by: Cody Hill
date: 1/18/2024
This project is intended as a demonstration and exploration in training machine learning predictive models on Oura Ring exported user data. The ground truth target variable is the sleep score assigned to each day from the exported data. Along with the goal of creating a well-fit model for this problem, both regression and classification models will be used in this project, exploring the efficacy of turning the discrete sleep scores into ordinal categorical variables.
Since the data collection period is currently limited, this project has been made with the goal of handling continued data uploads (all data cleaning/analysis will be generalized for continuous new data).
This project is licensed under the terms of the MIT License, but is intended for private use only.
If you fork or use any part of this project please attribute Cody Hill as the creator of this work.
Utilize these links to more easily navigate through the notebook.
Note: A link will be included at the bottom of each numbered section back to the table of contents.
All data has been exported from my personal Oura Ring containing raw biometric data and Oura calculated data since I began wearing the device.
The Oura Ring tracks and records over 20 biometric signals from the sensors on the inside of the ring throughout the day and during sleep. Along with the raw biometric data, Oura's software engineers new metrics to assist in calculating a daily score assigned to categories such as sleep, recovery, readiness, activity, etc.. In total 89 features can be extracted from a user's account giving historical data since the beginning of the user's wear time. Most of these features are daily cumulative sums of metrics or other measures that typically have 1 entry per day, but the sleep data has entries for every time a user is sleeping, potentially allowing for multiple entries per day.
Notable Data Information:
730 rows of sleep data, which equates to that number of sleep events recorded.
Feature Information (more info can be found at https://cloud.ouraring.com/edu/sleep_score):
score) Ranging from 0-100, the sleep score is an overall measure of how well you slept.awake_time) Awake time is the time spent awake in bed before and after falling asleep.bedtime_start_delta) Bedtime is an estimate of the time you went to bed with the intention to sleep. Delta measures the difference of your bedtime compared to your regular bedtime (calculates continuously).deep_sleep_duration) Deep sleep is the most restorative and rejuvenating sleep stage, enabling muscle growth and repair. The amount of deep sleep can vary significantly between nights and individuals. It can make up anywhere between 0-35% of your total sleep time.light_sleep_duration) Light sleep makes up about 50% of total sleep time for adults, and typically begins a sleep cycle.rem_sleep_duration) REM (rapid eye movement) sleep is the final sleep stage in a typical sleep cycle. It’s associated with dreaming, memory consolidation, learning and creativity.total_sleep_duration) Total sleep time refers to the total amount of time you spend in light, REM, and deep sleep.average_breath) Oura tracks the number of breaths you take per minute while you sleep, and shows your nocturnal average respiratory rate.latency) Sleep latency is the time it takes for you to fall asleep.average_hrv) When a person is relaxed, a healthy heart’s beating rate shows variation in the time interval between heartbeats.temperature_deviation) Oura measures your body temperature while you sleep. It sets the baseline for your normal temperature during the first couple of weeks, and adjusts it if needed as more data is collected. Variations are shown in relation to your baseline, represented by 0.0.active_calories) Activity burn shows the kilocalories you've burned by daily movement and exercise.low_activity_time) Low activity includes activities such as casual walking and light housework both indoors and outdoors.medium_activity_time) Medium activity includes dynamic activities with an intensity level equivalent to brisk walking.high_activity_time) High activity includes vigorous activities with an intensity level higher or equivalent to jogging.sedentary_time) Inactive time includes sitting, standing or otherwise being passive.import os
import sys
import numpy as np
import pandas as pd
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, AdaBoostClassifier
from sklearn.svm import SVC, SVR
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_auc_score, roc_curve, RocCurveDisplay, mean_squared_error, mean_absolute_error, auc, f1_score
from sklearn.preprocessing import StandardScaler, label_binarize, LabelBinarizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeClassifier
import scipy
import scipy.stats as stats
import statsmodels.api as sm
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import seaborn as sns
print(f"Python version: {sys.version}")
packages = [np, pd, sklearn, scipy, sm, matplotlib, sns]
for package in packages:
print(f"{str(package).partition('from')[0]} using version: {package.__version__}")
Python version: 3.11.7 (main, Dec 4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] <module 'numpy' using version: 1.26.3 <module 'pandas' using version: 2.1.4 <module 'sklearn' using version: 1.3.2 <module 'scipy' using version: 1.11.4 <module 'statsmodels.api' using version: 0.14.1 <module 'matplotlib' using version: 3.8.2 <module 'seaborn' using version: 0.13.2
As discussed above, we will import the data using a few separate dataframes.
biometric_df for the biometric data that has one entry per day.sleep_df for all the recorded sleep events containing potentially multiple entries per day.daily_sleep_score contains the daily sleep score which will be used as the ground truth target variable.# Import data
current_wdir = os.getcwd()
data_folder = current_wdir + '/Data'
sleep_data_folder = current_wdir + '/Data/Sleep_Data'
# Iterate through each file in .Data/ and add it to a dataframe.
file_path = [f'{data_folder}/{file}' for file in os.listdir(data_folder) if '.csv' in file]
# Using a list to concat the dfs with index_col allows to easily merge based on 'day'. More memory usage but fine for this project.
biometric_df = pd.concat([pd.read_csv(file, index_col = 'day') for file in file_path], join = 'outer', ignore_index = False, axis = 1).reset_index()
# Iterate through each file in .Data/Sleep_Data and add it to a dataframe.
# Separated sleep data as it potentially has multiple entries per day. Will merge them later.
file_path_sleep = [f'{sleep_data_folder}/{file}' for file in os.listdir(sleep_data_folder) if '.csv' in file and 'daily' not in file]
sleep_df = pd.concat(map(pd.read_csv, file_path_sleep), join = 'outer', ignore_index = False, axis = 1)
# Import ground truth label sleep score.
file_path_daily_sleep_score = [f'{sleep_data_folder}/{file}' for file in os.listdir(sleep_data_folder) if '.csv' in file and 'daily' in file]
daily_sleep_score_df = pd.read_csv(file_path_daily_sleep_score[0])
daily_sleep_score_df = daily_sleep_score_df[['score', 'day']]
Let's take a look what types of data the columns have been given.
biometric_df.dtypes.value_counts()
int64 25 float64 13 object 4 Name: count, dtype: int64
sleep_df.dtypes.value_counts()
float64 33 object 9 int64 5 Name: count, dtype: int64
Majority of these 89 features are numerical and 13 features showing up as objects which are typically strings.
Now to display the dataframes to get a better idea what these features are and since there are 89 columns which is too many to display, print out the column names.
# First detailed looks
print('biometric_df info:\n------------------\n', biometric_df.shape)
display(biometric_df.head(2))
print(biometric_df.columns)
print('\n\nsleep_df info:\n------------------\n', sleep_df.shape)
display(sleep_df.head(4))
print(sleep_df.columns)
biometric_df info: ------------------ (369, 42)
| day | active_calories | average_met_minutes | equivalent_walking_distance | high_activity_met_minutes | high_activity_time | inactivity_alerts | low_activity_met_minutes | low_activity_time | medium_activity_met_minutes | ... | temperature_trend_deviation | contributors_activity_balance | contributors_hrv_balance | contributors_previous_day_activity | contributors_previous_night | contributors_recovery_index | contributors_resting_heart_rate | contributors_sleep_balance | contributors_body_temperature | spo2_percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-02-03 | 140 | 1.18750 | 1969 | 0 | 0 | 1 | 86 | 8460 | 8 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2023-02-04 | 650 | 1.59375 | 10728 | 16 | 120 | 0 | 277 | 23220 | 172 | ... | NaN | NaN | NaN | 96.0 | 74.0 | 97.0 | 94.0 | NaN | 90.0 | 98.523 |
2 rows × 42 columns
Index(['day', 'active_calories', 'average_met_minutes',
'equivalent_walking_distance', 'high_activity_met_minutes',
'high_activity_time', 'inactivity_alerts', 'low_activity_met_minutes',
'low_activity_time', 'medium_activity_met_minutes',
'medium_activity_time', 'meters_to_target', 'non_wear_time',
'resting_time', 'sedentary_met_minutes', 'sedentary_time', 'steps',
'target_calories', 'target_meters', 'total_calories', 'score',
'class_5_min', 'contributors_meet_daily_targets',
'contributors_move_every_hour', 'contributors_recovery_time',
'contributors_stay_active', 'contributors_training_frequency',
'contributors_training_volume', 'met_1_min', 'ring_met_1_min', 'score',
'temperature_deviation', 'temperature_trend_deviation',
'contributors_activity_balance', 'contributors_hrv_balance',
'contributors_previous_day_activity', 'contributors_previous_night',
'contributors_recovery_index', 'contributors_resting_heart_rate',
'contributors_sleep_balance', 'contributors_body_temperature',
'spo2_percentage'],
dtype='object')
sleep_df info:
------------------
(730, 47)
| average_breath | average_heart_rate | average_hrv | awake_time | bedtime_end | bedtime_start | day | deep_sleep_duration | efficiency | latency | ... | readiness_contributors_body_temperature | readiness_contributors_hrv_balance | readiness_contributors_previous_day_activity | readiness_contributors_previous_night | readiness_contributors_recovery_index | readiness_contributors_resting_heart_rate | readiness_contributors_sleep_balance | readiness_score | readiness_temperature_deviation | readiness_temperature_trend_deviation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 13.625 | 63.25 | 77.0 | 4440.0 | 2023-02-04T07:08:22.000-06:00 | 2023-02-03T22:40:22.000-06:00 | 2023-02-04 | 4650.0 | 85.0 | 990.0 | ... | 90.0 | NaN | 96.0 | 74.0 | 97.0 | 94.0 | NaN | 89.0 | -0.38 | NaN |
| 1 | 15.125 | 92.49 | 19.0 | 6210.0 | 2023-02-05T09:37:28.000-06:00 | 2023-02-04T23:54:28.000-06:00 | 2023-02-05 | 4590.0 | 82.0 | 180.0 | ... | 100.0 | NaN | 82.0 | 79.0 | 100.0 | 59.0 | NaN | 78.0 | -0.04 | 0.15 |
| 2 | 15.125 | 87.67 | NaN | 1920.0 | 2023-02-05T20:10:57.000-06:00 | 2023-02-05T19:36:57.000-06:00 | 2023-02-06 | 30.0 | 6.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 15.125 | 82.50 | 37.0 | 720.0 | 2023-02-05T20:36:02.000-06:00 | 2023-02-05T20:20:02.000-06:00 | 2023-02-06 | 0.0 | 25.0 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 rows × 47 columns
Index(['average_breath', 'average_heart_rate', 'average_hrv', 'awake_time',
'bedtime_end', 'bedtime_start', 'day', 'deep_sleep_duration',
'efficiency', 'latency', 'light_sleep_duration', 'lowest_heart_rate',
'movement_30_sec', 'period', 'rem_sleep_duration', 'restless_periods',
'score', 'segment_state', 'sleep_midpoint', 'time_in_bed',
'total_sleep_duration', 'type', 'sleep_phase_5_min', 'timezone',
'bedtime_start_delta', 'bedtime_end_delta', 'midpoint_at_delta',
'heart_rate_5_min', 'hrv_5_min', 'contributors_total_sleep',
'contributors_deep_sleep', 'contributors_rem_sleep',
'contributors_efficiency', 'contributors_latency',
'contributors_restfulness', 'contributors_timing',
'readiness_contributors_activity_balance',
'readiness_contributors_body_temperature',
'readiness_contributors_hrv_balance',
'readiness_contributors_previous_day_activity',
'readiness_contributors_previous_night',
'readiness_contributors_recovery_index',
'readiness_contributors_resting_heart_rate',
'readiness_contributors_sleep_balance', 'readiness_score',
'readiness_temperature_deviation',
'readiness_temperature_trend_deviation'],
dtype='object')
print('\n\ndaily_sleep_score_df info:\n--------------------------\n',daily_sleep_score_df.shape)
display(daily_sleep_score_df.head(4))
daily_sleep_score_df info: -------------------------- (355, 2)
| score | day | |
|---|---|---|
| 0 | 81 | 2023-02-04 |
| 1 | 81 | 2023-02-05 |
| 2 | 89 | 2023-02-06 |
| 3 | 78 | 2023-02-07 |
# Confirm there are no duplicates of days in the biometric or sleep score data.
print(biometric_df['day'].duplicated().sum())
print(daily_sleep_score_df['day'].duplicated().sum())
0 0
Note: Three dataframes have been initialized separately because of how the data is organized.
For instance,
biometric_df which is primarily related to activity and other biometrics. sleep_df potentially has multiple entries per day, indicating any detected sessions of sleep and their interruptions. daily_sleep_score_df contains what will be used as the ground truth ($\bm{Y}$), sleep score. Feature engineering and transformations will be performed on the two former dataframes before they are all combined for use in the models.
First, we must remove the features with labels that include "contributors" as these columns contain standardized scores with which Oura's models have aggregated into various scores which are then used to average out into the final score of various categories.
# Remove contributor and score columns (multiple scores involved, will add truth label later).
biometric_df = biometric_df.loc[:, ~biometric_df.columns.str.contains('contributors|score')]
sleep_df = sleep_df.loc[:, ~sleep_df.columns.str.contains('contributors|score')]
print(biometric_df.shape)
print(sleep_df.shape)
(369, 26) (730, 30)
We are left with 369 rows in the biometric_df which means 369 days, and 730 rows in the sleep_df which equates to that number of sleep events recorded.
We've removed 33 feature columns containing the word 'contributors' or 'score' as we'll be using the score in daily_sleep_score_df.
Let's take a closer look at the remaining features, identifying those that will be of use in our models or not, and continue further data preparations.
print(biometric_df.columns, sleep_df.columns)
Index(['day', 'active_calories', 'average_met_minutes',
'equivalent_walking_distance', 'high_activity_met_minutes',
'high_activity_time', 'inactivity_alerts', 'low_activity_met_minutes',
'low_activity_time', 'medium_activity_met_minutes',
'medium_activity_time', 'meters_to_target', 'non_wear_time',
'resting_time', 'sedentary_met_minutes', 'sedentary_time', 'steps',
'target_calories', 'target_meters', 'total_calories', 'class_5_min',
'met_1_min', 'ring_met_1_min', 'temperature_deviation',
'temperature_trend_deviation', 'spo2_percentage'],
dtype='object') Index(['average_breath', 'average_heart_rate', 'average_hrv', 'awake_time',
'bedtime_end', 'bedtime_start', 'day', 'deep_sleep_duration',
'efficiency', 'latency', 'light_sleep_duration', 'lowest_heart_rate',
'movement_30_sec', 'period', 'rem_sleep_duration', 'restless_periods',
'segment_state', 'sleep_midpoint', 'time_in_bed',
'total_sleep_duration', 'type', 'sleep_phase_5_min', 'timezone',
'bedtime_start_delta', 'bedtime_end_delta', 'midpoint_at_delta',
'heart_rate_5_min', 'hrv_5_min', 'readiness_temperature_deviation',
'readiness_temperature_trend_deviation'],
dtype='object')
We'll be removing any features that don't seem relevant to our sleep score prediction model. For now we will not touch most of the potentially "overlapping" features that might introduce predictor collinearity issues, but will remove some obvious ones and address the remainder later.
# Columns to drop
# sleep_df
drop_col_sleep = ['efficiency', 'period', 'segment_state',
'sleep_midpoint', 'sleep_phase_5_min', 'movement_30_sec',
'timezone', 'bedtime_end_delta', 'midpoint_at_delta',
'heart_rate_5_min', 'hrv_5_min'] # timezone? -- might need during EDA
sleep_df.drop(drop_col_sleep, axis = 1, inplace = True)
# biometric_df
drop_col_bio = ['average_met_minutes', 'equivalent_walking_distance',
'high_activity_met_minutes', 'inactivity_alerts',
'low_activity_met_minutes', 'medium_activity_met_minutes',
'sedentary_met_minutes', 'target_calories',
'target_meters', 'class_5_min',
'met_1_min', 'ring_met_1_min',
'temperature_trend_deviation']
biometric_df.drop(drop_col_bio, axis = 1, inplace = True)
Features like efficiency were removed for being too similar to other remaining features as it's just a ratio of time_in_bed and total_sleep_duration which would cause issues in our model later.
average_met_minutes, target_calories, target_meters, etc are daily goals used by Oura to calculate other metrics.
We'll also take a look at the remaining features' data types and see if any data munging is necessary (formatting all the features (float, int, dates, dummy/indicator encoding)). At the very least we know we have some date columns that will need to be transformed into pandas' datetime format.
There are a few date columns that would be beneficial to turn into pandas datetime objects.
# Reformat date columns
date_col_sleep = ['bedtime_end', 'bedtime_start', 'day']
# All times reformatted to UTC
sleep_df[date_col_sleep] = sleep_df[date_col_sleep].apply(pd.to_datetime, utc = True, errors = 'coerce')
biometric_df['day'] = biometric_df['day'].apply(pd.to_datetime, utc = True, errors = 'coerce')
daily_sleep_score_df['day'] = daily_sleep_score_df['day'].apply(pd.to_datetime, utc = True, errors = 'coerce')
Display all columns that aren't float or int.
print('biometric_df:')
print(biometric_df.select_dtypes(exclude = ['int64', 'float64']).dtypes)
print('\nsleep_df:')
print(sleep_df.select_dtypes(exclude = ['int64', 'float64']).dtypes)
biometric_df: day datetime64[ns, UTC] dtype: object sleep_df: bedtime_end datetime64[ns, UTC] bedtime_start datetime64[ns, UTC] day datetime64[ns, UTC] type object dtype: object
Everything above looks good.
we can see that ['type'] is an object which likely means it's a string. Let's see what values it can take and if it will be useful to keep.
sleep_df['type'].unique()
array(['long_sleep', nan, 'late_nap', 'rest', 'sleep'], dtype=object)
Like was mentioned before, the sleep_df contains multiple entries per day, which either is sleep, interrupted periods of sleep during the night, or afternoon naps. It looks like Oura uses the ['type'] feature to indicate what type of sleep each entry belongs to. We can use this information along with the time data in the next section to identify different periods of sleep and in a little feature engineering.
A reasonable solution seems to be to total all the sleep categories that occurred per day, accounting for sleep interruptions at night, and any naps counting towards that total as well. However, we will lose that nap information when we total the sleep times and it might also be beneficial to identify days in which naps occur using a binary boolean (0 = No Nap, 1 = Yes Nap), potentially increasing the sleep score. We will make a new feature with the sleep information before consolidation.
We will identify naps as any entry without the 'long_sleep' ['type'] and that occur between 9:00AM - 6:00PM.
# Initialize a time frame we can consider a nap/rest period (UTC format).
nap_upper = pd.to_datetime('23:59:00').time()
nap_lower = pd.to_datetime('15:00:00').time()
# Condition on the start, end time, and total sleep duration to filter out false-positives, remove any long_sleep types.
nap_bool = ((sleep_df['bedtime_start'].dt.time >= nap_lower) &
(sleep_df['bedtime_end'].dt.time <= nap_upper) &
(sleep_df['total_sleep_duration'] > 600) &
(sleep_df['type'] != 'long_sleep'))
print('Naps identified:\n')
display(sleep_df[nap_bool])
# Insert a nap_today column and binary yes/no for each day.
# Initialize column with zeroes.
sleep_df['nap_today'] = np.zeros_like(sleep_df.shape[0])
# Use boolean array to identify nap days and iterate through to change nap_today to 1.
# TODO: This is not the pandas way but is fine with this amount of data. Use a better method later.
# Something like, sleep_df['nap_today'].where(~nap_bool, 1, inplace = True) does not work in this situation..
# Need each entry for that day to = 1 not just the nap itself.
nap_days = sleep_df[nap_bool]['day']
for day in nap_days:
sleep_df.loc[sleep_df['day'] == day, 'nap_today'] = 1
Naps identified:
| average_breath | average_heart_rate | average_hrv | awake_time | bedtime_end | bedtime_start | day | deep_sleep_duration | latency | light_sleep_duration | lowest_heart_rate | rem_sleep_duration | restless_periods | time_in_bed | total_sleep_duration | type | bedtime_start_delta | readiness_temperature_deviation | readiness_temperature_trend_deviation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31 | 13.500 | 74.47 | 34.0 | 6270.0 | 2023-02-20 21:16:13+00:00 | 2023-02-20 19:05:13+00:00 | 2023-02-20 00:00:00+00:00 | 420.0 | 570.0 | 840.0 | 67.0 | 330.0 | 25.0 | 7860 | 1590.0 | NaN | 47113 | NaN | NaN |
| 34 | 13.875 | 70.14 | 58.0 | 3540.0 | 2023-02-21 23:27:03+00:00 | 2023-02-21 21:59:03+00:00 | 2023-02-21 00:00:00+00:00 | 120.0 | 3600.0 | 1140.0 | 60.0 | 480.0 | 23.0 | 5280 | 1740.0 | NaN | 57543 | NaN | NaN |
| 44 | 14.750 | 90.50 | 25.0 | 3690.0 | 2023-02-27 01:11:23+00:00 | 2023-02-26 23:54:23+00:00 | 2023-02-27 00:00:00+00:00 | 300.0 | 2760.0 | 540.0 | 90.0 | 90.0 | 24.0 | 4620 | 930.0 | NaN | 60863 | NaN | NaN |
| 53 | 14.750 | NaN | NaN | 1140.0 | 2023-03-01 23:27:00+00:00 | 2023-03-01 22:51:00+00:00 | 2023-03-01 00:00:00+00:00 | 570.0 | 750.0 | 420.0 | NaN | 30.0 | 11.0 | 2160 | 1020.0 | NaN | 60660 | NaN | NaN |
| 63 | 14.625 | 71.67 | NaN | 930.0 | 2023-03-06 22:03:32+00:00 | 2023-03-06 21:30:32+00:00 | 2023-03-06 00:00:00+00:00 | 630.0 | 390.0 | 420.0 | 70.0 | 0.0 | 31.0 | 1980 | 1050.0 | NaN | 55832 | NaN | NaN |
| 88 | 13.750 | 65.15 | 75.0 | 1530.0 | 2023-03-17 21:33:33+00:00 | 2023-03-17 20:17:33+00:00 | 2023-03-17 00:00:00+00:00 | 90.0 | 450.0 | 2940.0 | 62.0 | 0.0 | 88.0 | 4560 | 3030.0 | NaN | 55053 | NaN | NaN |
| 89 | 13.750 | 68.33 | 63.0 | 2070.0 | 2023-03-17 23:52:33+00:00 | 2023-03-17 22:48:33+00:00 | 2023-03-18 00:00:00+00:00 | 540.0 | 1410.0 | 780.0 | 63.0 | 450.0 | 12.0 | 3840 | 1770.0 | late_nap | 64113 | NaN | NaN |
| 116 | 13.625 | 70.40 | 70.0 | 3030.0 | 2023-03-30 20:27:01+00:00 | 2023-03-30 19:26:01+00:00 | 2023-03-30 00:00:00+00:00 | 180.0 | 690.0 | 450.0 | 68.0 | 0.0 | 10.0 | 3660 | 630.0 | NaN | 51961 | NaN | NaN |
| 152 | 14.125 | 61.20 | 107.0 | 750.0 | 2023-04-23 21:10:49+00:00 | 2023-04-23 20:29:49+00:00 | 2023-04-23 00:00:00+00:00 | 1260.0 | 330.0 | 450.0 | 56.0 | 0.0 | 15.0 | 2460 | 1710.0 | sleep | 55789 | NaN | NaN |
| 180 | 14.875 | 71.20 | 80.0 | 1080.0 | 2023-05-10 18:45:42+00:00 | 2023-05-10 17:44:42+00:00 | 2023-05-10 00:00:00+00:00 | 810.0 | 570.0 | 1770.0 | 67.0 | 0.0 | 29.0 | 3660 | 2580.0 | sleep | 45882 | NaN | NaN |
| 197 | 13.500 | 66.75 | 117.0 | 1080.0 | 2023-05-16 23:12:57+00:00 | 2023-05-16 22:33:57+00:00 | 2023-05-17 00:00:00+00:00 | 60.0 | 1020.0 | 1200.0 | 63.0 | 0.0 | 19.0 | 2340 | 1260.0 | late_nap | 63237 | NaN | NaN |
| 210 | 13.500 | 73.29 | 81.0 | 1680.0 | 2023-05-20 23:05:22+00:00 | 2023-05-20 22:17:22+00:00 | 2023-05-21 00:00:00+00:00 | 0.0 | 1770.0 | 1200.0 | 69.0 | 0.0 | 17.0 | 2880 | 1200.0 | late_nap | 62242 | NaN | NaN |
| 233 | 14.500 | 70.20 | 64.0 | 1440.0 | 2023-06-03 23:11:00+00:00 | 2023-06-03 22:14:00+00:00 | 2023-06-04 00:00:00+00:00 | 1590.0 | 1320.0 | 390.0 | 64.0 | 0.0 | 13.0 | 3420 | 1980.0 | NaN | -20760 | NaN | NaN |
| 251 | 14.250 | 59.73 | 123.0 | 1650.0 | 2023-06-11 21:39:31+00:00 | 2023-06-11 20:30:31+00:00 | 2023-06-11 00:00:00+00:00 | 900.0 | 600.0 | 1590.0 | 53.0 | 0.0 | 30.0 | 4140 | 2490.0 | NaN | 59431 | NaN | NaN |
| 258 | 14.125 | 75.67 | 73.0 | 570.0 | 2023-06-17 21:25:32+00:00 | 2023-06-17 20:52:32+00:00 | 2023-06-17 00:00:00+00:00 | 150.0 | 150.0 | 1260.0 | 72.0 | 0.0 | 18.0 | 1980 | 1410.0 | sleep | 60752 | NaN | NaN |
| 261 | 14.625 | 56.83 | 123.0 | 3090.0 | 2023-06-18 20:52:32+00:00 | 2023-06-18 19:36:32+00:00 | 2023-06-18 00:00:00+00:00 | 150.0 | 1770.0 | 1320.0 | 52.0 | 0.0 | 15.0 | 4560 | 1470.0 | NaN | 56192 | NaN | -0.13 |
| 264 | 15.375 | NaN | NaN | 1980.0 | 2023-06-19 18:20:32+00:00 | 2023-06-19 17:33:02+00:00 | 2023-06-19 00:00:00+00:00 | 120.0 | 1920.0 | 750.0 | NaN | 0.0 | 29.0 | 2850 | 870.0 | NaN | 48782 | NaN | -0.14 |
| 298 | 16.875 | NaN | NaN | 1141.0 | 2023-07-09 20:44:02+00:00 | 2023-07-09 20:07:01+00:00 | 2023-07-09 00:00:00+00:00 | 0.0 | 1140.0 | 1080.0 | NaN | 0.0 | 31.0 | 2221 | 1080.0 | NaN | 58021 | NaN | -0.03 |
| 322 | 13.375 | NaN | NaN | 1050.0 | 2023-07-22 19:31:02+00:00 | 2023-07-22 19:02:02+00:00 | 2023-07-22 00:00:00+00:00 | 0.0 | 0.0 | 690.0 | NaN | 0.0 | 23.0 | 1740 | 690.0 | NaN | 54122 | NaN | 0.11 |
| 323 | 13.375 | 73.73 | 69.0 | 2070.0 | 2023-07-22 21:10:02+00:00 | 2023-07-22 19:36:02+00:00 | 2023-07-22 00:00:00+00:00 | 1170.0 | 1140.0 | 1800.0 | 70.0 | 600.0 | 72.0 | 5640 | 3570.0 | NaN | 56162 | NaN | 0.11 |
| 325 | 14.250 | 56.38 | 103.0 | 2070.0 | 2023-07-23 21:47:02+00:00 | 2023-07-23 20:21:32+00:00 | 2023-07-23 00:00:00+00:00 | 1200.0 | 1080.0 | 1860.0 | 53.0 | 0.0 | 19.0 | 5130 | 3060.0 | NaN | 58892 | NaN | 0.26 |
| 347 | 14.500 | 64.60 | 96.0 | 1260.0 | 2023-08-03 21:04:32+00:00 | 2023-08-03 20:16:02+00:00 | 2023-08-03 00:00:00+00:00 | 450.0 | 990.0 | 1200.0 | 62.0 | 0.0 | 3.0 | 2910 | 1650.0 | NaN | 58562 | NaN | -0.14 |
| 349 | 14.750 | 72.75 | 79.0 | 1260.0 | 2023-08-04 22:19:30+00:00 | 2023-08-04 21:28:30+00:00 | 2023-08-05 00:00:00+00:00 | 300.0 | 600.0 | 1500.0 | 66.0 | 0.0 | 32.0 | 3060 | 1800.0 | late_nap | 62910 | NaN | -0.02 |
| 353 | 14.125 | NaN | NaN | 691.0 | 2023-08-06 20:23:34+00:00 | 2023-08-06 19:54:33+00:00 | 2023-08-06 00:00:00+00:00 | 0.0 | 390.0 | 1050.0 | NaN | 0.0 | 10.0 | 1741 | 1050.0 | NaN | 57273 | NaN | 0.10 |
| 377 | 16.250 | NaN | NaN | 571.0 | 2023-08-20 22:38:00+00:00 | 2023-08-20 22:10:29+00:00 | 2023-08-20 00:00:00+00:00 | 90.0 | 420.0 | 960.0 | NaN | 30.0 | 0.0 | 1651 | 1080.0 | NaN | 61829 | NaN | 0.05 |
| 408 | 15.750 | NaN | NaN | 1202.0 | 2023-09-04 23:57:04+00:00 | 2023-09-04 23:25:32+00:00 | 2023-09-05 00:00:00+00:00 | 60.0 | 0.0 | 630.0 | NaN | 0.0 | 23.0 | 1892 | 690.0 | NaN | -20068 | NaN | 0.16 |
| 452 | 17.875 | 89.67 | 21.0 | 1020.0 | 2023-09-25 16:37:57+00:00 | 2023-09-25 15:56:27+00:00 | 2023-09-26 00:00:00+00:00 | 210.0 | 690.0 | 1260.0 | 88.0 | 0.0 | 11.0 | 2490 | 1470.0 | NaN | 64587 | NaN | 0.06 |
| 455 | 15.375 | 79.80 | NaN | 1830.0 | 2023-09-27 16:11:57+00:00 | 2023-09-27 15:21:27+00:00 | 2023-09-28 00:00:00+00:00 | 30.0 | 0.0 | 1170.0 | 79.0 | 0.0 | 33.0 | 3030 | 1200.0 | NaN | 62487 | NaN | 0.18 |
| 463 | 16.375 | NaN | NaN | 570.0 | 2023-09-30 21:22:29+00:00 | 2023-09-30 20:46:29+00:00 | 2023-10-01 00:00:00+00:00 | 210.0 | 300.0 | 1380.0 | NaN | 0.0 | 8.0 | 2160 | 1590.0 | NaN | -8011 | NaN | 0.14 |
| 508 | 13.375 | 61.17 | 101.0 | 420.0 | 2023-10-22 23:16:01+00:00 | 2023-10-22 22:30:01+00:00 | 2023-10-23 00:00:00+00:00 | 270.0 | 720.0 | 2010.0 | 58.0 | 60.0 | 22.0 | 2760 | 2340.0 | late_nap | 63001 | NaN | -0.28 |
| 522 | 13.875 | 62.00 | 92.0 | 2670.0 | 2023-10-27 21:00:04+00:00 | 2023-10-27 19:52:34+00:00 | 2023-10-27 00:00:00+00:00 | 60.0 | 2130.0 | 1230.0 | 59.0 | 90.0 | 16.0 | 4050 | 1380.0 | NaN | 53554 | NaN | 0.03 |
| 540 | 15.375 | 68.50 | 66.0 | 1380.0 | 2023-11-05 22:15:01+00:00 | 2023-11-05 21:33:01+00:00 | 2023-11-05 00:00:00+00:00 | 360.0 | 1380.0 | 780.0 | 66.0 | 0.0 | 31.0 | 2520 | 1140.0 | sleep | 55981 | NaN | -0.03 |
| 543 | 14.125 | NaN | NaN | 841.0 | 2023-11-06 23:03:30+00:00 | 2023-11-06 22:33:59+00:00 | 2023-11-06 00:00:00+00:00 | 30.0 | 810.0 | 900.0 | NaN | 0.0 | 7.0 | 1771 | 930.0 | NaN | 59639 | NaN | NaN |
| 590 | 15.500 | 90.60 | NaN | 1260.0 | 2023-11-30 20:52:33+00:00 | 2023-11-30 20:20:03+00:00 | 2023-11-30 00:00:00+00:00 | 0.0 | 0.0 | 690.0 | 86.0 | 0.0 | 23.0 | 1950 | 690.0 | NaN | 51603 | NaN | NaN |
| 594 | 14.375 | 73.23 | 51.0 | 6150.0 | 2023-12-02 23:08:02+00:00 | 2023-12-02 20:21:02+00:00 | 2023-12-02 00:00:00+00:00 | 1230.0 | 4860.0 | 1800.0 | 66.0 | 840.0 | 0.0 | 10020 | 3870.0 | NaN | 51662 | 0.47 | 0.53 |
| 617 | 14.750 | 74.90 | 45.0 | 1830.0 | 2023-12-16 22:11:30+00:00 | 2023-12-16 21:04:30+00:00 | 2023-12-16 00:00:00+00:00 | 150.0 | 1260.0 | 2040.0 | 66.0 | 0.0 | 38.0 | 4020 | 2190.0 | NaN | 54270 | NaN | 0.10 |
| 654 | 14.500 | 57.82 | 123.0 | 1500.0 | 2024-01-02 23:22:30+00:00 | 2024-01-02 21:40:30+00:00 | 2024-01-02 00:00:00+00:00 | 720.0 | 990.0 | 2910.0 | 54.0 | 990.0 | 16.0 | 6120 | 4620.0 | sleep | 56430 | NaN | 0.27 |
| 722 | 14.625 | 63.80 | 82.0 | 2190.0 | 2024-02-03 21:12:02+00:00 | 2024-02-03 20:07:02+00:00 | 2024-02-03 00:00:00+00:00 | 60.0 | 1470.0 | 1650.0 | 56.0 | 0.0 | 27.0 | 3900 | 1710.0 | sleep | 50822 | NaN | -0.54 |
| 725 | 14.000 | NaN | NaN | 3060.0 | 2024-02-04 22:54:30+00:00 | 2024-02-04 21:45:00+00:00 | 2024-02-04 00:00:00+00:00 | 0.0 | 2880.0 | 1110.0 | NaN | 0.0 | 0.0 | 4170 | 1110.0 | NaN | 56700 | NaN | -0.63 |
Continuing to setup our feature dataframe, the day in which the biometric data is being collected will affect the next day. Currently, the sleep score is recorded on the day you wake on, meaning each date's score is referencing the previous night's sleep. This is important to note because the biometric features we're left with (e.g., calories burned, steps, heart rate, etc. in biometric_df) are being collected on the day in which a sleep score has already been recorded.
Logically this data, mostly indicating activity levels, is more relevant to the upcoming night's sleep, therefore, since we're organizing metrics per day, we need to shift this date 1 day to align with our model's target/response variable (sleep score). This will align the dataframes for when we merge them using the ['day'] column later.
Example:
| Score | Day (sleep_df) | Bedtime Start | Bedtime End | / | / | Day (biometric_df) | Steps | Active Calories | |
|---|---|---|---|---|---|---|---|---|---|
| ... | ... | ... | ... | / | / | ... | ... | ... | |
| 81 | 01/20/2024 | 1/19/2024 22:30 | 1/20/2024 07:00 | / | / | 01/20/2024 | 13000 | 800 | |
| 65 | 01/21/2024 | 1/20/2024 23:45 | 1/21/2024 05:20 | / | / | 01/21/2024 | 21000 | 1900 | |
| ... | ... | ... | ... | / | / | ... | ... | ... |
| Score | Day (sleep_df) | Bedtime Start | Bedtime End | / | / | Steps | Active Calories | ||
|---|---|---|---|---|---|---|---|---|---|
| ... | ... | ... | ... | / | / | ... ⇩ | ... ⇩ | ...⇩ | |
| 81 | 01/20/2024 | 1/19/2024 22:30 | 1/20/2024 07:00 | / | / | ...⇩ | ...⇩ | ...⇩ | |
| 65 | 01/21/2024 | 1/20/2024 23:45 | 1/21/2024 05:20 | / | / | 13000 ⇩ | 800 ⇩ | ||
| ... | ... | ... | ... | / | / | 21000 ⇩ | 1900 ⇩ |
# Shift day one day into the future to align with relevant sleep score day.
biometric_df['day'] = biometric_df['day'] + pd.DateOffset(days = 1)
# Remove last day in biometric dataset since it won't have a sleep score with day shifting.
bio_max_idx = biometric_df.loc[biometric_df['day'] == biometric_df['day'].max()].index
biometric_df.drop(index = bio_max_idx, axis = 0, inplace = True)
# Verify each dataframe starts and ends on the same day.
print('Earliest Dates in each DF:')
print(biometric_df['day'].min(), '<-- biometric_df')
print(sleep_df['day'].min(), '<-- sleep_df')
print(daily_sleep_score_df['day'].min(), '<-- daily_sleep_score_df')
print('\nLatest Dates in each DF:')
print(biometric_df['day'].max(), '<-- biometric_df')
print(sleep_df['day'].max(), '<-- sleep_df')
print(daily_sleep_score_df['day'].max(), '<-- daily_sleep_score_df')
Earliest Dates in each DF: 2023-02-04 00:00:00+00:00 <-- biometric_df 2023-02-04 00:00:00+00:00 <-- sleep_df 2023-02-04 00:00:00+00:00 <-- daily_sleep_score_df Latest Dates in each DF: 2024-02-06 00:00:00+00:00 <-- biometric_df 2024-02-06 00:00:00+00:00 <-- sleep_df 2024-02-06 00:00:00+00:00 <-- daily_sleep_score_df
Now we should identify any days in which a sleep score wasn't recorded or accurate. Say because there wasn't enough sleep recorded, or maybe the ring wasn't worn during the night but still recorded biometric data for the day. We'll again use the long_sleep variable in ['type'] to identify days where sleep was detected and use it as a mask to identify days where no sleep was recorded.
# Identify days where there wasn't a long period of sleep
sleep_df_long = sleep_df.loc[sleep_df['type'] == 'long_sleep']
# Use merge() to find the differences in dataframes indicating days that didn't record a long sleep.
# Requires a method where the two df indexes don't match.
# Credit to: https://stackoverflow.com/questions/48647534/find-difference-between-two-data-frames
no_sleep = pd.DataFrame(sleep_df['day']).merge(sleep_df_long['day'], indicator = True, how='left').loc[lambda x : x['_merge']!='both']
print('Entries in which no long periods of sleep were recorded that day:')
display(no_sleep)
# We can remove these days with no long_sleep from the main dataframes as they won't have sleep scores anyways.
# Sleep DF
sleep_df.drop(no_sleep.index, axis = 0, inplace = True)
# Biometric DF
# We can use our previous work and just remove the days that the two DFs don't share.
no_sleep_bio_days = pd.DataFrame(biometric_df['day']).merge(sleep_df['day'], indicator = True, how='left').loc[lambda x : x['_merge']!='both']['day']
no_sleep_bio_index = biometric_df[biometric_df['day'].isin(no_sleep_bio_days)].index
biometric_df.drop(no_sleep_bio_index, axis = 0, inplace = True)
# Daily Sleep Score DF
no_sleep_score_index = pd.DataFrame(daily_sleep_score_df['day']).merge(biometric_df['day'], indicator = True, how='left').loc[lambda x : x['_merge']!='both']['day'].index
daily_sleep_score_df.drop(index = no_sleep_score_index, axis = 0, inplace = True)
Entries in which no long periods of sleep were recorded that day:
| day | _merge | |
|---|---|---|
| 103 | 2023-03-24 00:00:00+00:00 | left_only |
| 140 | 2023-04-15 00:00:00+00:00 | left_only |
| 161 | 2023-05-01 00:00:00+00:00 | left_only |
| 162 | 2023-05-01 00:00:00+00:00 | left_only |
| 163 | 2023-05-02 00:00:00+00:00 | left_only |
| 173 | 2023-05-08 00:00:00+00:00 | left_only |
| 223 | 2023-05-28 00:00:00+00:00 | left_only |
| 224 | 2023-05-28 00:00:00+00:00 | left_only |
| 326 | 2023-07-24 00:00:00+00:00 | left_only |
| 402 | 2023-08-31 00:00:00+00:00 | left_only |
| 441 | 2023-09-21 00:00:00+00:00 | left_only |
| 442 | 2023-09-21 00:00:00+00:00 | left_only |
| 536 | 2023-11-05 00:00:00+00:00 | left_only |
| 537 | 2023-11-05 00:00:00+00:00 | left_only |
| 538 | 2023-11-05 00:00:00+00:00 | left_only |
| 539 | 2023-11-05 00:00:00+00:00 | left_only |
| 540 | 2023-11-05 00:00:00+00:00 | left_only |
| 588 | 2023-11-29 00:00:00+00:00 | left_only |
As we've seen, our target variable (response), sleep score is in the form of discrete integers. This data comes to us from Oura's own models in which the exact process of how this number was obtained has been obfuscated from us. There's an argument to be made whether or not we could treat this as a regression task or classification task, or whether there's ever merit in casting a regression task into the classification space, as transforming a target variable almost certainly introduces bias and causes information loss, but we will discuss that further in the conclusion section.
For now, since this project is just an exercise and since our dataset has relatively few samples, it might be beneficial to explore transforming the sleep score into categorical variable bins. We will keep the discrete sleep score column for use in our regression models, but create another few columns using different techniques that indicates a bin that score falls into for use in logistic regression/classification models.
Enough theory, let's discuss how we can accomplish binning any variable. We have a few different methods we can use to approach how to categorize sleep scores:
Here we will implement a few versions of these binning techniques and analyze which makes the most sense for our sleep score data.
Below, an even-width binning method has been implemented with 10 bins with a width of 10. An offset of 0.5 was used so the edges would automatically decide which bin to take since our sleep score data all hold integer values. Justifications for these parameters were fairly arbitrary, but seemed justifiable that there is a difference between a sleep score of 79 than, say, 88.
# Instead of having 0-100 potential label predictions when using a classification model,
# we should create a categorical 'binned' label column.
# Create bins. We can use .5 to easily decide left edge or right edge issues.
bin_number = 10
bin_interval_even = np.linspace(0.5, 100.5, bin_number + 1, dtype = float)
bin_labels = [f"{i} to {i + bin_number}" for i in bin_interval_even if i != bin_interval_even[-1]]
# np.digitize() can map the sleep scores to the correct bins
bin_idx = np.digitize(daily_sleep_score_df['score'], bin_interval_even, right = False)
# Create new column with the new bin labels
bin_map_even = [bin_labels[value - 1] for value in bin_idx]
daily_sleep_score_df['score_bin_even'] = bin_map_even
print(bin_labels)
display(daily_sleep_score_df['score_bin_even'].value_counts())
daily_sleep_score_df['score'].plot(kind='kde')
plt.vlines(bin_interval_even, ymin = 0, ymax = 0.055, color = 'purple', alpha = 0.5)
for idx, label in enumerate(bin_labels):
plt.text(x = ((bin_interval_even[idx] + bin_interval_even[idx + 1]) / 2) - 1.5,
y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
s = label,
rotation = 'vertical',
alpha = 0.5)
plt.xlabel('Sleep Score')
['0.5 to 10.5', '10.5 to 20.5', '20.5 to 30.5', '30.5 to 40.5', '40.5 to 50.5', '50.5 to 60.5', '60.5 to 70.5', '70.5 to 80.5', '80.5 to 90.5', '90.5 to 100.5']
score_bin_even 70.5 to 80.5 165 80.5 to 90.5 101 60.5 to 70.5 68 50.5 to 60.5 9 40.5 to 50.5 5 90.5 to 100.5 4 30.5 to 40.5 1 Name: count, dtype: int64
Text(0.5, 0, 'Sleep Score')
You'll notice that once overlaid the distribution of the sleep score data, there appears to be a lot of wasted bins that don't capture any data. Logically you would think to increase the bin size, reducing the total number, but you quickly run into a different problem. If you increase the bin size you risk having the majority of the data land in only a few classes, and arguably that's a problem with the parameters chosen here where the majority of the data will land in a few bins causing sever class imbalance. However, decreasing the bin size will arguably make the interpretability of the class boundaries worse and increase the outlier effects.
The distribution of scores are split into 5 quantiles (quintiles) meaning each bin will divide the range of data into 1/5 bins. The bins here will be labeled and distributed between 5 bins: ['Worst', 'Bad', 'Avg', 'Good', 'Best'].
Note: this and the follow method denotes an 'Avg' bin but in reality it is actually calculated using the median but they're close enough here that they effectively mean the same thing.
#labels_quantile = ['Worst', 'Bad', 'Below Avg','Above Avg', 'Good', 'Best']
labels_quantile = ['Worst', 'Bad', 'Avg', 'Good', 'Best']
bin_map_quantile, bin_interval_quantile = pd.qcut(daily_sleep_score_df['score'], 5, labels = labels_quantile, retbins = True)
daily_sleep_score_df['score_bin_quantile'] = bin_map_quantile
print('Quantile Bins:', bin_interval_quantile)
print('\nDistribution of Bins:', bin_map_quantile.value_counts())
daily_sleep_score_df['score'].plot(kind = 'kde')
plt.vlines(bin_interval_quantile, ymin = 0, ymax = 0.055, color = 'purple', alpha = 0.5)
for idx, label in enumerate(labels_quantile):
plt.text(x = ((bin_interval_quantile[idx] + bin_interval_quantile[idx + 1]) / 2) - 1.5,
y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
s = f'{label} - {bin_map_quantile.value_counts()[label]}',
rotation = 'vertical',
alpha = 0.5)
plt.xlabel('Sleep Score')
Quantile Bins: [36. 69. 75. 78. 82. 94.] Distribution of Bins: score Bad 84 Worst 73 Best 68 Good 66 Avg 62 Name: count, dtype: int64
Text(0.5, 0, 'Sleep Score')
With very little parameter tweaking you can see the bin intervals overlaid on the distribution fits the data much better compared to the even-width method. This produced bins with reasonable balance that matches the data distribution. Another positive attribute of this method is the min and max interval starts at the min and max values of the data. You can see how this method will limit how outliers affect our classifications by way of naturally widening bins by a scalar of the standard deviation as they get further from the median value as there are less of those score values.
The creation of this method isn't too dissimilar to the previous quantile bin implementation, but uses only the median and standard deviation to create each interval. Since this data is my own, the justifications were based on insights of my own perception of sleep quality (i.e., I believe my sleep generally skews towards poor sleep, therefore the bins and total number of classes in each should reflect that perception which is backed up here by the distribution's left tailedness.). In this method it was also determined to try 6 bins effectively splitting the median with the labels: ['Worst', 'Bad', 'Below Avg', 'Above Avg', 'Good', 'Best']
Note: No hard-coding was used to get these intervals, only the median and standard deviations so it will adjust when the data is updated, potentially shifting if my sleep consistently gets better.
labels_custom = ['Worst', 'Bad', 'Below Avg', 'Above Avg', 'Good', 'Best']
median_score = daily_sleep_score_df['score'].median()
std_score = daily_sleep_score_df['score'].std()
bin_interval_custom = [0,
median_score - (1.5 * std_score),
median_score - (0.5 * std_score),
median_score,
median_score + (0.5 * std_score),
median_score + (1.5 * std_score),
100]
bin_map_custom = pd.cut(daily_sleep_score_df['score'], bins = bin_interval_custom, labels = labels_custom)
daily_sleep_score_df['score_bin_custom'] = bin_map_custom
print('Custom Quantile Bins:', bin_interval_custom)
print('\nDistribution of Bins:', bin_map_custom.value_counts())
daily_sleep_score_df['score'].plot(kind='kde')
plt.vlines(bin_interval_custom, ymin = 0, ymax = 0.055, color = 'purple', alpha = 0.5)
for idx, label in enumerate(labels_custom):
plt.text(x = ((bin_interval_custom[idx] + bin_interval_custom[idx + 1]) / 2) - 1.5,
y = (plt.ylim()[0] + plt.ylim()[1]) / 2,
s = f'{label} - {bin_map_custom.value_counts()[label]}',
rotation = 'vertical',
alpha = 0.5)
plt.xlabel('Sleep Score')
Custom Quantile Bins: [0, 64.39549639460978, 72.79849879820325, 77.0, 81.20150120179675, 89.60450360539022, 100] Distribution of Bins: score Below Avg 88 Good 78 Bad 77 Above Avg 75 Worst 30 Best 5 Name: count, dtype: int64
Text(0.5, 0, 'Sleep Score')
This distribution of bins looks to match my personal domain knowledge that is my sleep quality, we'll see later how it performs.
Here we can take a look at a slice of the date with our new categorical versions of the target variable (y).
display(daily_sleep_score_df.tail(5))
| score | day | score_bin_even | score_bin_quantile | score_bin_custom | |
|---|---|---|---|---|---|
| 350 | 68 | 2024-02-02 00:00:00+00:00 | 60.5 to 70.5 | Worst | Bad |
| 351 | 76 | 2024-02-03 00:00:00+00:00 | 70.5 to 80.5 | Avg | Below Avg |
| 352 | 75 | 2024-02-04 00:00:00+00:00 | 70.5 to 80.5 | Bad | Below Avg |
| 353 | 83 | 2024-02-05 00:00:00+00:00 | 80.5 to 90.5 | Best | Good |
| 354 | 83 | 2024-02-06 00:00:00+00:00 | 80.5 to 90.5 | Best | Good |
Finally, now that we've extracted the features we wanted from sleep_df, we can total the multiple sleep entries per day and then combine the sleep metrics with the biometric data.
Note: some metrics are not summed in this process, and of those, only the metrics tied to the 'long_sleep' entry will be used in the final dataset. (e.g., we don't really care what the average heart rate variability (['average_hrv']) or (['latency']) is for an interrupted sleep section during the night that only lasted 30 minutes.)
# Sum each day's sleep metrics (important for days with interrupted sleep or naps).
# Overwrites each day's entry.
sleep_df = sleep_df[['day',
'deep_sleep_duration',
'light_sleep_duration',
'rem_sleep_duration',
'restless_periods',
'awake_time',
'time_in_bed',
'total_sleep_duration']].groupby(['day']).sum().reset_index()
# Features extracted only from the long sleep to include in the final dateframe.
sleep_df_long = sleep_df_long[['day',
'average_breath',
'average_heart_rate',
'average_hrv',
'latency',
'lowest_heart_rate',
'bedtime_start_delta',
'nap_today']]
# Merge the summed features with the long_sleep-only features.
sleep_df = sleep_df.merge(sleep_df_long, on = 'day', how = 'outer')
sleep_df.head(2)
| day | deep_sleep_duration | light_sleep_duration | rem_sleep_duration | restless_periods | awake_time | time_in_bed | total_sleep_duration | average_breath | average_heart_rate | average_hrv | latency | lowest_heart_rate | bedtime_start_delta | nap_today | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023-02-04 00:00:00+00:00 | 4650.0 | 15570.0 | 5820.0 | 282.0 | 4440.0 | 30480 | 26040.0 | 13.625 | 63.25 | 77.0 | 990.0 | 56.0 | -4778 | 0 |
| 1 | 2023-02-05 00:00:00+00:00 | 4590.0 | 14970.0 | 9210.0 | 240.0 | 6210.0 | 34980 | 28770.0 | 15.125 | 92.49 | 19.0 | 180.0 | 75.0 | -332 | 0 |
Now that all feature cleaning, creation, and transformations are complete we should verify each dataframe matches shape for the upcoming merge.
print('Verify all DF.shape matches:', sleep_df.shape[0] == biometric_df.shape[0] == daily_sleep_score_df.shape[0])
print(sleep_df.shape, '<-- sleep_df')
print(biometric_df.shape, '<-- biometric_df')
print(daily_sleep_score_df.shape, '<-- daily_sleep_score_df')
Verify all DF.shape matches: True (353, 15) <-- sleep_df (353, 13) <-- biometric_df (353, 5) <-- daily_sleep_score_df
Now, let's create one dataframe for both the scores and features, then display them.
# Merge sleep and biometric DFs.
bio_sleep_df = sleep_df.merge(biometric_df, on = 'day', how = 'outer')
# Merge sleep score df and feature df.
bio_sleep_df = daily_sleep_score_df.merge(bio_sleep_df, on = 'day', how = 'outer')
bio_sleep_df.reset_index(drop = True, inplace = True)
display(bio_sleep_df)
print(bio_sleep_df.columns)
| score | day | score_bin_even | score_bin_quantile | score_bin_custom | deep_sleep_duration | light_sleep_duration | rem_sleep_duration | restless_periods | awake_time | ... | low_activity_time | medium_activity_time | meters_to_target | non_wear_time | resting_time | sedentary_time | steps | total_calories | temperature_deviation | spo2_percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 81 | 2023-02-04 00:00:00+00:00 | 80.5 to 90.5 | Good | Above Avg | 4650.0 | 15570.0 | 5820.0 | 282.0 | 4440.0 | ... | 8460 | 120 | 6300 | 36000 | 19680 | 22140 | 2718 | 2429 | NaN | NaN |
| 1 | 81 | 2023-02-05 00:00:00+00:00 | 80.5 to 90.5 | Good | Above Avg | 4590.0 | 14970.0 | 9210.0 | 240.0 | 6210.0 | ... | 23220 | 2940 | 0 | 0 | 25740 | 34380 | 11638 | 3075 | -0.38 | 98.523 |
| 2 | 89 | 2023-02-06 00:00:00+00:00 | 80.5 to 90.5 | Best | Good | 8580.0 | 15210.0 | 7350.0 | 302.0 | 6240.0 | ... | 15000 | 2580 | 2700 | 0 | 35280 | 33480 | 9001 | 2814 | -0.04 | 97.181 |
| 3 | 78 | 2023-02-07 00:00:00+00:00 | 70.5 to 80.5 | Avg | Above Avg | 4500.0 | 17700.0 | 6030.0 | 284.0 | 5010.0 | ... | 13440 | 2160 | 5300 | 0 | 31080 | 39600 | 7544 | 2748 | -0.22 | 99.177 |
| 4 | 80 | 2023-02-08 00:00:00+00:00 | 70.5 to 80.5 | Good | Above Avg | 4920.0 | 15630.0 | 6420.0 | 245.0 | 6210.0 | ... | 16560 | 3180 | 2300 | 4800 | 31020 | 30840 | 8668 | 2852 | -0.25 | 98.634 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 348 | 68 | 2024-02-02 00:00:00+00:00 | 60.5 to 70.5 | Worst | Bad | 4890.0 | 13800.0 | 4890.0 | 197.0 | 9753.0 | ... | 18000 | 5400 | -4400 | 60 | 26700 | 36120 | 13714 | 3151 | -0.22 | 96.706 |
| 349 | 76 | 2024-02-03 00:00:00+00:00 | 70.5 to 80.5 | Avg | Below Avg | 5220.0 | 17790.0 | 4770.0 | 215.0 | 9180.0 | ... | 20820 | 3660 | -3700 | 1020 | 36420 | 24360 | 13388 | 3092 | -0.71 | 99.200 |
| 350 | 75 | 2024-02-04 00:00:00+00:00 | 70.5 to 80.5 | Bad | Below Avg | 4980.0 | 13770.0 | 4950.0 | 203.0 | 9386.0 | ... | 15840 | 4260 | -2800 | 3600 | 30780 | 31860 | 13186 | 3016 | -0.60 | 98.131 |
| 351 | 83 | 2024-02-05 00:00:00+00:00 | 80.5 to 90.5 | Best | Good | 7140.0 | 12990.0 | 7320.0 | 187.0 | 6501.0 | ... | 13260 | 2520 | 400 | 360 | 31680 | 38400 | 8380 | 2773 | -0.48 | 98.595 |
| 352 | 83 | 2024-02-06 00:00:00+00:00 | 80.5 to 90.5 | Best | Good | 5700.0 | 17700.0 | 5310.0 | 200.0 | 5767.0 | ... | 17820 | 2820 | -300 | 1980 | 33480 | 30300 | 9970 | 2881 | -0.14 | 98.291 |
353 rows × 31 columns
Index(['score', 'day', 'score_bin_even', 'score_bin_quantile',
'score_bin_custom', 'deep_sleep_duration', 'light_sleep_duration',
'rem_sleep_duration', 'restless_periods', 'awake_time', 'time_in_bed',
'total_sleep_duration', 'average_breath', 'average_heart_rate',
'average_hrv', 'latency', 'lowest_heart_rate', 'bedtime_start_delta',
'nap_today', 'active_calories', 'high_activity_time',
'low_activity_time', 'medium_activity_time', 'meters_to_target',
'non_wear_time', 'resting_time', 'sedentary_time', 'steps',
'total_calories', 'temperature_deviation', 'spo2_percentage'],
dtype='object')
Lastly, we need to see what data is missing from our final dataframe and decide how to rectify any found.
# Explore and deal with NaN values.
print('Columns with NaNs present:', bio_sleep_df.columns[bio_sleep_df.isna().any()].to_list())
print(f"{'':<5}Total number of NaNs: {bio_sleep_df.isna().sum(axis = 1).sum()}")
Columns with NaNs present: ['temperature_deviation', 'spo2_percentage']
Total number of NaNs: 48
48 NaN values were found only in the temperature_deviation and spo2_percentage columns. Without assuming the importance of these features, removing 48 rows of data just for these missing values seems a bad choice with the limited amount of data. Since the range of values doesn't vary too much in these features, as to not alter the distribution of the features, any NaN value will be replaced with the median value of that feature.
# A good option to deal with each feature individually.
# >>> values = {{"Feature 1": .median(), "Feature 2": 0, "Feature 3: .mean()}}
# >>> df.fillna(value=values)
# Only feature with NaN currently is SPO2 and temp. deviation, using median for now.
# TODO: Come back and tune each feature.
for column in bio_sleep_df:
if bio_sleep_df[column].dtype == 'float64' or bio_sleep_df[column].dtype == 'int64':
bio_sleep_df[column].fillna(bio_sleep_df[column].median(), inplace = True)
else:
continue
print(f'Number of NaN values remaining after adjustments: {bio_sleep_df.isna().sum(axis = 1).sum()}')
Number of NaN values remaining after adjustments: 0
First let's take a look at the high-level summary.
bio_sleep_df.describe()
| score | deep_sleep_duration | light_sleep_duration | rem_sleep_duration | restless_periods | awake_time | time_in_bed | total_sleep_duration | average_breath | average_heart_rate | ... | low_activity_time | medium_activity_time | meters_to_target | non_wear_time | resting_time | sedentary_time | steps | total_calories | temperature_deviation | spo2_percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | ... | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 | 353.000000 |
| mean | 75.597734 | 5265.042493 | 15382.776204 | 5475.042493 | 244.045326 | 6445.603399 | 32571.694051 | 26122.861190 | 13.884561 | 65.803088 | ... | 17263.682720 | 3128.838527 | -1493.484419 | 2969.745042 | 31003.342776 | 31844.362606 | 10819.963173 | 2940.813031 | 0.007450 | 97.842986 |
| std | 8.403002 | 1667.128745 | 2500.618719 | 1456.503659 | 54.461654 | 2899.808704 | 4112.171365 | 3212.363992 | 0.481592 | 7.836005 | ... | 4813.519199 | 1855.161430 | 4482.521299 | 6154.118003 | 4777.300153 | 6612.383545 | 3447.932402 | 247.672000 | 0.216197 | 0.851728 |
| min | 36.000000 | 210.000000 | 9150.000000 | 1110.000000 | 88.000000 | 1590.000000 | 19740.000000 | 14040.000000 | 12.750000 | 49.400000 | ... | 900.000000 | 120.000000 | -38700.000000 | 0.000000 | 8280.000000 | 5040.000000 | 1601.000000 | 2229.000000 | -0.710000 | 92.072000 |
| 25% | 71.000000 | 4320.000000 | 13710.000000 | 4590.000000 | 211.000000 | 4555.000000 | 30221.000000 | 24060.000000 | 13.625000 | 60.240000 | ... | 13980.000000 | 2160.000000 | -3700.000000 | 0.000000 | 28800.000000 | 27360.000000 | 8522.000000 | 2802.000000 | -0.110000 | 97.472000 |
| 50% | 77.000000 | 5130.000000 | 15300.000000 | 5490.000000 | 239.000000 | 5940.000000 | 32460.000000 | 26280.000000 | 13.875000 | 64.170000 | ... | 16860.000000 | 2820.000000 | -800.000000 | 360.000000 | 31380.000000 | 32640.000000 | 10388.000000 | 2917.000000 | 0.020000 | 98.027000 |
| 75% | 81.000000 | 6120.000000 | 17040.000000 | 6390.000000 | 280.000000 | 7560.000000 | 35345.000000 | 28080.000000 | 14.125000 | 69.730000 | ... | 19920.000000 | 3720.000000 | 1100.000000 | 3480.000000 | 33900.000000 | 36660.000000 | 12708.000000 | 3060.000000 | 0.120000 | 98.419000 |
| max | 94.000000 | 10200.000000 | 21690.000000 | 9420.000000 | 492.000000 | 27456.000000 | 51036.000000 | 35520.000000 | 15.875000 | 96.680000 | ... | 36240.000000 | 24660.000000 | 8900.000000 | 51840.000000 | 49260.000000 | 50160.000000 | 23382.000000 | 5036.000000 | 1.420000 | 99.200000 |
8 rows × 27 columns
score seems to be slightly left-skewed with the mean of 54.59 below the median of 77.00.deep_sleep_duration and rem_sleep_duration have similar data distributions.light_sleep_duration looks to be the most common type of sleep when compared to the others.steps averages ~ 10819 per day which is something to be happy about.27 features is too many to plot a full pairwise plot so a few notable ones will be selected.
The binned sleep scores developed earlier can be used as a convenient way to potentially visualize clusters of data with respect to sleep score. First we can plot features that are directly related to sleep metrics.
pair_columns_sleep = ['score', 'score_bin_custom', 'deep_sleep_duration',
'light_sleep_duration', 'rem_sleep_duration',
'latency', 'bedtime_start_delta']
sns.pairplot(bio_sleep_df[pair_columns_sleep], hue = 'score_bin_custom')
<seaborn.axisgrid.PairGrid at 0x289030a90>
pair_columns_sleep2 = ['score', 'score_bin_custom', 'restless_periods',
'awake_time', 'average_breath', 'average_hrv',
'nap_today']
sns.pairplot(bio_sleep_df[pair_columns_sleep2], hue = 'score_bin_custom')
<seaborn.axisgrid.PairGrid at 0x28a52b2d0>
Now let's plot a few features that are related to daily activity levels to see how they relate to sleep score and each other.
pair_columns_active = ['score', 'score_bin_custom',
'high_activity_time','low_activity_time', 'medium_activity_time',
'resting_time', 'sedentary_time']
sns.pairplot(bio_sleep_df[pair_columns_active], hue = 'score_bin_custom')
<seaborn.axisgrid.PairGrid at 0x28ee2f8d0>
Majority of the time data is in seconds, which is hard to interpret. Creating a seconds to hours converter will be helpful for EDA.
# Function to transform inputs from seconds to hours.
def sec_to_hours(seconds):
return seconds / 60 / 60
Let's see how the summary of the direct sleep metrics look when converted to hours.
print('Types of Sleep in Hours:')
sec_to_hours(bio_sleep_df[['deep_sleep_duration',
'light_sleep_duration',
'rem_sleep_duration',
'total_sleep_duration']].describe())
Types of Sleep in Hours:
| deep_sleep_duration | light_sleep_duration | rem_sleep_duration | total_sleep_duration | |
|---|---|---|---|---|
| count | 0.098056 | 0.098056 | 0.098056 | 0.098056 |
| mean | 1.462512 | 4.272993 | 1.520845 | 7.256350 |
| std | 0.463091 | 0.694616 | 0.404584 | 0.892323 |
| min | 0.058333 | 2.541667 | 0.308333 | 3.900000 |
| 25% | 1.200000 | 3.808333 | 1.275000 | 6.683333 |
| 50% | 1.425000 | 4.250000 | 1.525000 | 7.300000 |
| 75% | 1.700000 | 4.733333 | 1.775000 | 7.800000 |
| max | 2.833333 | 6.025000 | 2.616667 | 9.866667 |
Much easier to interpret now.
Going further, we can visualize the total_sleep_duration distribution.
x_min = min(sec_to_hours(bio_sleep_df['total_sleep_duration']))
x_max = max(sec_to_hours(bio_sleep_df['total_sleep_duration']))
sec_to_hours(bio_sleep_df['total_sleep_duration']).plot(kind = 'kde')
plt.title('Distribution of Total Amount of Sleep per Day')
plt.xlabel('Hours of Sleep')
plt.xticks(np.arange(round(x_min - 1), round(x_max + 1), 0.5))
plt.xlim(round(x_min - 1), round(x_max + 1))
plt.grid(alpha = 0.4)
It might be interesting to see if there's any changes in sleep throughout the year.
ax = sns.boxplot(x = bio_sleep_df['day'].dt.month, y = daily_sleep_score_df['score'])
ax.set(title = 'Sleep Score by Month Boxplot', xlabel = 'month')
plt.show()
Majority of the sleep score outliers appear during the winter months, though we can see the variance increase in August and September.
Let's try to plot another feature in the same way that probably affects sleep score.
ax = sns.boxplot(x = bio_sleep_df['day'].dt.month, y = bio_sleep_df['awake_time'])
ax.set(title = 'Awake Time by Month Boxplot', xlabel = 'month')
plt.show()
Recall, awake_time tracks the amount of time in bed but not asleep.
Interestingly, it is easy to identify that there is a negative correlation between awake_time and sleep score. Whenever awake time increases, for example in May (5), sleep score decreases during the same month.
Some of the features that likely have the biggest effect on sleep score are probably going to be features that measure sleep duration. A plot of sleep score against sleep durations would be useful at determining by how much.
fig, ax = plt.subplots(2,2, figsize = (6, 6), sharey = True)
ax[0, 0].scatter(bio_sleep_df['total_sleep_duration'], bio_sleep_df['score'], alpha = 0.5)
ax[0, 0].set_title('Total Sleep')
ax[0, 1].scatter(bio_sleep_df['deep_sleep_duration'], bio_sleep_df['score'], alpha = 0.5)
ax[0, 1].set_title('Deep Sleep')
ax[1, 0].scatter(bio_sleep_df['rem_sleep_duration'], bio_sleep_df['score'], alpha = 0.5)
ax[1, 0].set_title('R.E.M. Sleep')
ax[1, 1].scatter(bio_sleep_df['light_sleep_duration'], bio_sleep_df['score'], alpha = 0.5)
ax[1, 1].set_title('Light Sleep')
fig.suptitle('Sleep Score vs Types of Sleep')
fig.supxlabel('Duration (seconds)')
fig.supylabel('Sleep Score')
fig.tight_layout()
R.E.M. and Deep sleep seem to have the biggest positive correlation to sleep score. Light also does but is much less obvious than the others. Total sleep has the biggest but since it's just a linear combination of the others, that makes sense (more on this later).
We're done exploring the data in relation to time so the ['day'] column is no longer necessary, it will be dropped from the dataframe from here on out.
# Remove day column in preparation for model training.
bio_sleep_df.drop(['day'], axis = 1, inplace = True)
It's time to look at each feature in relation to each other and determine if there will be any correlation/collinearity issues using this feature set in the model fitting. This will aid in the process of removing features. It will also give us the first quantitative, though limited, look at how the independent variables (predictor features) are correlated to the dependant (target) variable.
# Correlation
# Include score column but drop categorical bins to see correlation between response and predictors too.
corr_matrix = bio_sleep_df.drop(['score_bin_even', 'score_bin_quantile', 'score_bin_custom'], axis = 1).corr()
# Plot matrix.
fig, ax = plt.subplots(figsize = (16, 16))
sns.heatmap(corr_matrix, ax = ax, annot = True, fmt = '.1f')
<Axes: >
With the correlation matrix we can identify potentially important features if they have high correlation with sleep score.
total_sleep_durationrestless_periodsawake_timelatencybedtime_start_deltaMoving past what is just correlated to score we also need to be aware of any corelation between features which will introduce collinearity between predictors/features into our models. In most cases, this reduces the effectiveness in a model.
total_sleep_duration is logically highly correlated to the various types of other sleep durations and other features directly related to sleep. This is due to total sleep being a linear combinations of the others. (Adding deep, light, rem sleep together equals total sleep.) It seems better practice to keep the 3 specific types of sleep as they contain more specific information.time_in_bed also has this issue for the same reasons.lowest_heart_rate, average_hrv, and a few others also have high correlation between them.Some or many of these features will have to be removed, but in most cases we only want to remove one feature out of the pairs that show correlation.
Before removing features let's look at one that showed a relatively small correlation with sleep score. From the correlation matrix we can see that bedtime_start_delta has a correlation of -3 with score. Let's analyze this further to try to determine if there is significance in keeping it for our model.
Recall, that the bedtime start delta feature is calculated by taking the difference of the start of sleep and the 'regular' time of sleep, taking the baseline to be an average over the past two weeks since the user's bedtime may shift over time.
bio_sleep_df[['bedtime_start_delta']].describe()
| bedtime_start_delta | |
|---|---|
| count | 353.000000 |
| mean | -1218.929178 |
| std | 3302.654300 |
| min | -14398.000000 |
| 25% | -3177.000000 |
| 50% | -1109.000000 |
| 75% | 629.000000 |
| max | 13351.000000 |
plt.scatter(bio_sleep_df['bedtime_start_delta'], bio_sleep_df['score'], alpha = 0.5)
plt.xlabel('Bedtime Start Delta (seconds)')
plt.ylabel('Sleep Score')
Text(0, 0.5, 'Sleep Score')
Again, bedtime_start_delta has a correlation score of -0.3 in relation to sleep score. Looking at this graph it is a little difficult to see this correlation, but there does seem to be a slight negative trend.
Also a negative correlation makes logical sense here as:
bedtime_start_delta = 0 that indicates a bedtime that aligns with the 'regular' baseline bedtime.bedtime_start_delta increases in the positive direction it indicates sleep was initiated past the 'regular' bedtime. The graph shows a negative trend as this happens.bedtime_start_delta decreases in the negative direction it indicates sleep was initiated before the 'regular' bedtime. The graph shows a very slight positive trend as this happens, but is harder to see than the previous change.The intuition is that going to bed early would increase quality of sleep and thus sleep score, and conversely the opposite for a late bedtime.
Given this slightly vague correlation, it would be beneficial to test the effects of bedtime_start_delta on sleep score using a hypothesis test.
First, let's define what constitutes an early or late bedtime (corresponding to a negative or positive bedtime_start_delta)
Early Bedtime = bedtime_start_delta < 0
Late Bedtime = bedtime_start_delta > 0
We will then group the sleep score by the two bedtimes (early or late) and perform a hypothesis test to determine if the samples of sleep scores have a different population means.
Let's begin by checking to see what the sample mean of these two sleep score distributions are.
late_bedtime_scores = bio_sleep_df[bio_sleep_df['bedtime_start_delta'] > 0]['score']
early_bedtime_scores = bio_sleep_df[bio_sleep_df['bedtime_start_delta'] < 0]['score']
# Sample mean checks.
print(f"The sample mean of sleep scores considered to have late bedtimes: {late_bedtime_scores.mean()}")
print(f"\nThe sample mean of sleep scores considered to have early bedtimes: {early_bedtime_scores.mean()}\n")
The sample mean of sleep scores considered to have late bedtimes: 72.30630630630631 The sample mean of sleep scores considered to have early bedtimes: 77.10743801652893
Null Hypothesis - $H_0: \mu_{score | early bedtime} = \mu_{score | late bedtime}$
Alternate Hypothesis - $H_1: \mu_{score | early bedtime} \not= \mu_{score | late bedtime}$
All tests will be done with a significance level ($\alpha$): 0.05
Note: Before we can perform this two-sample t-test, we need to confirm some assumptions that are required by the test about the data.
# Set level of significance.
alpha = 0.05
fig, ax = plt.subplots(2, 2, figsize = (6, 6))
sns.histplot(late_bedtime_scores, kde = True, ax = ax[0, 0])
sm.qqplot(late_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[0, 1])
ax[0,0].set_title('Late Bedtime Score: Density')
ax[0,1].set_title('Late Bedtime Score: QQ Plot')
sns.histplot(early_bedtime_scores, kde = True, ax = ax[1, 0])
sm.qqplot(early_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[1, 1])
ax[1,0].set_title('Early Bedtime Score: Density')
ax[1,1].set_title('Early Bedtime Score: QQ Plot')
fig.tight_layout()
plt.show()
Looking at both the distributions and QQ plots it looks like they are almost normally distributed but not quite. The distributions are both right-skewed (negative-skewed) and the QQ plot tails curve off the 45 degree line.
We can confirm this quantitatively with a Shapiro-Wilk test.
Null Hypothesis - $H_0: \bm{X} \sim Normal(\mu, \sigma^2)$, the samples do come from a normal distribution.
Alternate Hypothesis - $H_1: \bm{X} \nsim Normal(\mu, \sigma^2)$, the samples DO NOT come from a normal distribution.
# TODO: If data is updated the following is not setup if sample conditions change drastically.
# Let's test if our samples display normality with the shapiro-wilk test.
# Null hypothesis H_0: samples come from a normal distribution.
print('Early Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(early_bedtime_scores)
if p_value_sw > alpha:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples DO come from a normal distribution.")
else:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples do NOT come from a normal distribution.")
print('\nLate Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(late_bedtime_scores)
if p_value_sw > alpha:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples DO come from a normal distribution.")
else:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw:.4e}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples do NOT come from a normal distribution.")
Early Bedtime:
Shapiro-Wilk test (Normality):
with a p-value of 1.0500e-08,
we successfully reject the null hypothesis: samples are from a normally distributed population.
Samples do NOT come from a normal distribution.
Late Bedtime:
Shapiro-Wilk test (Normality):
with a p-value of 2.2177e-06,
we successfully reject the null hypothesis: samples are from a normally distributed population.
Samples do NOT come from a normal distribution.
Both groups rejected the null hypothesis that the data is from a normally distributed population.
To move on we can attempt to transform the data to fix this normality issue.
For a slight negative skew, a common transformation is:
$\sqrt{max(x+1) - x}$
Then for a larger skew, log base 10:
$\log_{10}{(max(x+1) - x)}$
Finally for even larger skews, inverse:
$\frac{1}{max(x+1) - x}$
Note: for a left-skew (positive) just remove the max() and subtraction which follows the pattern $\sqrt{x}$.
# Transformations
# Square root first
late_bedtime_scores = np.sqrt(max(late_bedtime_scores + 1) - late_bedtime_scores)
early_bedtime_scores = np.sqrt(max(early_bedtime_scores + 1) - early_bedtime_scores)
Now to perform the previous checks to see if the transformations made a difference.
fig, ax = plt.subplots(2, 2, figsize = (6, 6))
sns.histplot(late_bedtime_scores, kde = True, ax = ax[0, 0])
sm.qqplot(late_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[0, 1])
ax[0,0].set_title('Late Bedtime Score: Density')
ax[0,1].set_title('Late Bedtime Score: QQ Plot')
sns.histplot(early_bedtime_scores, kde = True, ax = ax[1, 0])
sm.qqplot(early_bedtime_scores, fit = 'scale', line = '45', alpha = 0.5 , ax = ax[1, 1])
ax[1,0].set_title('Early Bedtime Score: Density')
ax[1,1].set_title('Early Bedtime Score: QQ Plot')
fig.tight_layout()
plt.show()
# Null hypothesis H_0: samples come from a normal distribution.
print('Early Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(early_bedtime_scores)
if p_value_sw > alpha:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples DO come from a normal distribution.")
else:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples do NOT come from a normal distribution.")
print('\nLate Bedtime:')
t_stat_sw, p_value_sw = stats.shapiro(late_bedtime_scores)
if p_value_sw > alpha:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we fail to reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples DO come from a normal distribution.")
else:
print(f"Shapiro-Wilk test (Normality): \n{'':<5}with a p-value of {p_value_sw}, \n{'':<5}we successfully reject the null hypothesis: samples are from a normally distributed population.")
print("\nSamples do NOT come from a normal distribution.")
Early Bedtime:
Shapiro-Wilk test (Normality):
with a p-value of 0.051339421421289444,
we fail to reject the null hypothesis: samples are from a normally distributed population.
Samples DO come from a normal distribution.
Late Bedtime:
Shapiro-Wilk test (Normality):
with a p-value of 0.06582079827785492,
we fail to reject the null hypothesis: samples are from a normally distributed population.
Samples DO come from a normal distribution.
Great, the transformations did the trick, but just barely. Judging by these and the previous plots it looks like there are a few outliers in the tails of the data that are also causing havoc. We'll leave them in for now, but will address them before building our models.
Since we've now proved our data comes from a normal distribution that also satisfies the use of an F-test to determine if our two groups have equal variances. (If normality was not confirmed the Levene test could be performed instead, but the t-test could not anyways.)
F-test for Equality of Variances (Two-Tailed Version)
Null Hypothesis - $H_0: \sigma_{score | early bedtime}^{2} = \sigma_{score | late bedtime}^{2}$
Alternate Hypothesis - $H_1: \sigma_{score | early bedtime}^{2} \not= \sigma_{score | late bedtime}^{2}$
# Calculate variance of each group.
s1 = np.var(late_bedtime_scores, ddof = 1) # ddof = 1 to calculate sample var.
s2 = np.var(early_bedtime_scores, ddof = 1)
# Null hypothesis H_0: Var(late) == Var(early)
# Larger var needs to be on top.
if s2 > s1:
f_stat = s2 / s1
df_num = len(early_bedtime_scores) - 1 # numerator
df_denom = len(late_bedtime_scores) - 1 # denominator
else:
f_stat = s1 / s2
df_num = len(late_bedtime_scores) - 1 # numerator
df_denom = len(early_bedtime_scores) - 1 # denominator
p_value_f = 1 - stats.f.cdf(f_stat, df_num, df_denom)
if p_value_f > alpha:
print(f"F-test (variance): \n{'':<5}with a p-value of {p_value_f}, \n{'':<5}we fail to reject the null hypothesis: Var(late) == Var(early)")
print("\nPopulations HAVE equal variances.")
else:
print(f"F-test (variance): \n{'':<5}with a p-value of {p_value_f}, \n{'':<5}we successfully reject the null hypothesis: Var(late) == Var(early)")
print("\nPopulations DO NOT have equal variances.")
F-test (variance):
with a p-value of 0.4487558615463687,
we fail to reject the null hypothesis: Var(late) == Var(early)
Populations HAVE equal variances.
Great, this shows that we come move forward with the two-sample t-test to see if these two groups have equal means or not.
Reminder,
Null Hypothesis - $H_0: \mu_{score | early bedtime} = \mu_{score | late bedtime}$
Alternate Hypothesis - $H_1: \mu_{score | early bedtime} \not= \mu_{score | late bedtime}$
# Finally let's perform a t-test to see if the mean scores are equal or not depending on the sleep time.
# Null hypothesis H_0: mean(late) == mean(early)
t_stat_t, p_value_t = stats.ttest_ind(late_bedtime_scores, early_bedtime_scores, alternative = 'two-sided')
if p_value_t > alpha:
print(f"Two-Sample t-test (mean): \n{'':<5}with a p-value of {p_value_t}, \n{'':<5}we fail to reject the null hypothesis: mean(late) == mean(early)")
print(f"\nThe means of the two samples are equal!")
else:
print(f"Two-Sample t-test (mean): \n{'':<5}with a p-value of {p_value_t}, \n{'':<5}we successfully reject the null hypothesis: mean(late) == mean(early)")
print(f"\nThe two sleep score population means are NOT equal!")
Two-Sample t-test (mean):
with a p-value of 0.013629025835902385,
we successfully reject the null hypothesis: mean(late) == mean(early)
The two sleep score population means are NOT equal!
This hypothesis test shows that the population mean sleep scores of the two bedtime_start_delta sets (late and early bedtimes) are different.
$\mu_{score | early bedtime} \not= \mu_{score | late bedtime}$
We can extrapolate that to say bedtime_start_delta does have some effect on sleep score and is a feature worth adding in the models.
Before removing any features let's fit a linear regression model to ['score'] as the target variable (response). This should eventually help us identify important features and give a good idea about the severity of multicollinearity between predictors.
linear_clf = sm.OLS(bio_sleep_df.iloc[:, :1], sm.add_constant(bio_sleep_df.iloc[:, 4:])).fit()
linear_clf.summary()
| Dep. Variable: | score | R-squared: | 0.885 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.877 |
| Method: | Least Squares | F-statistic: | 101.0 |
| Date: | Wed, 28 Feb 2024 | Prob (F-statistic): | 1.30e-137 |
| Time: | 16:52:21 | Log-Likelihood: | -869.51 |
| No. Observations: | 353 | AIC: | 1791. |
| Df Residuals: | 327 | BIC: | 1892. |
| Df Model: | 25 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 4.3622 | 28.532 | 0.153 | 0.879 | -51.768 | 60.493 |
| deep_sleep_duration | -0.0005 | 0.001 | -0.550 | 0.583 | -0.002 | 0.001 |
| light_sleep_duration | -0.0014 | 0.001 | -1.458 | 0.146 | -0.003 | 0.000 |
| rem_sleep_duration | -5.281e-05 | 0.001 | -0.056 | 0.955 | -0.002 | 0.002 |
| restless_periods | -0.0221 | 0.004 | -5.402 | 0.000 | -0.030 | -0.014 |
| awake_time | -0.0053 | 0.004 | -1.426 | 0.155 | -0.013 | 0.002 |
| time_in_bed | 0.0047 | 0.004 | 1.253 | 0.211 | -0.003 | 0.012 |
| total_sleep_duration | -0.0019 | 0.003 | -0.692 | 0.489 | -0.007 | 0.004 |
| average_breath | 0.3119 | 0.478 | 0.653 | 0.514 | -0.628 | 1.252 |
| average_heart_rate | -0.0980 | 0.071 | -1.389 | 0.166 | -0.237 | 0.041 |
| average_hrv | -0.0070 | 0.014 | -0.487 | 0.626 | -0.035 | 0.021 |
| latency | -0.0021 | 0.000 | -12.681 | 0.000 | -0.002 | -0.002 |
| lowest_heart_rate | 0.0248 | 0.060 | 0.413 | 0.680 | -0.093 | 0.143 |
| bedtime_start_delta | -0.0006 | 6.11e-05 | -9.205 | 0.000 | -0.001 | -0.000 |
| nap_today | -0.6323 | 0.557 | -1.134 | 0.257 | -1.729 | 0.464 |
| active_calories | -0.0025 | 0.010 | -0.257 | 0.797 | -0.022 | 0.017 |
| high_activity_time | 0.0005 | 0.001 | 0.329 | 0.742 | -0.002 | 0.003 |
| low_activity_time | -0.0001 | 0.000 | -0.642 | 0.521 | -0.001 | 0.000 |
| medium_activity_time | -0.0002 | 0.000 | -0.490 | 0.624 | -0.001 | 0.001 |
| meters_to_target | -4.253e-05 | 0.000 | -0.416 | 0.678 | -0.000 | 0.000 |
| non_wear_time | 7.165e-05 | 0.000 | 0.565 | 0.572 | -0.000 | 0.000 |
| resting_time | 0.0001 | 0.000 | 1.042 | 0.298 | -0.000 | 0.000 |
| sedentary_time | 6.74e-05 | 0.000 | 0.490 | 0.624 | -0.000 | 0.000 |
| steps | 2.323e-05 | 0.000 | 0.195 | 0.845 | -0.000 | 0.000 |
| total_calories | 0.0065 | 0.009 | 0.719 | 0.473 | -0.011 | 0.024 |
| temperature_deviation | 0.3037 | 0.828 | 0.367 | 0.714 | -1.325 | 1.932 |
| spo2_percentage | 0.1547 | 0.210 | 0.736 | 0.462 | -0.259 | 0.568 |
| Omnibus: | 18.471 | Durbin-Watson: | 1.919 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 23.825 |
| Skew: | -0.438 | Prob(JB): | 6.71e-06 |
| Kurtosis: | 3.924 | Cond. No. | 1.00e+16 |
Unsurprisingly, this shows a deeply flawed model when using all available features. There is a huge collinearity and multicollinearity issue between the predictor features currently fitted into this model indicated by the Condition Number = 1.00e+16. Also, while the adjust R2 and F-stat p-values indicate a well-fitted model, we can see the feature's t-test p-values indicate almost no features show significance, but this is likely only due to the previously mentioned collinearity and multicollinearity issues.
We will remove redundant and insignificant features from our dataset and refit this model after to see how our changes affect the model's performance.
A quick sanity check for whether this data will work is compatible with linear regression methods is to analyze the residuals. Here we can plot the residuals against the fitted values of the model and confirm whether there is a linear relationship or heteroskedasticity (non-equal variance) issues.
plt.scatter(linear_clf.fittedvalues, linear_clf.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('MLR Residuals vs Fitted Values')
plt.hlines(y = 0, xmin = 40, xmax = 100, colors = 'black', linestyles='--')
plt.show()
This looks reasonable for checking the boxes for linear regression. While there are certainly outliers to deal with, no indication of non-linear relationships or funnelling of the points that would indicate the issues mentioned above.
Now that we've done the work to understand the data, we can use what we've learned above to make decisions on which features to keep or remove.
# Remove sign on correlation score for comparison.
# TODO: This method arbitrarily selects the first feature (of the two with a high score)
# that it comes across. Change this method to have an insight-based selection method.
# Possibly a preference list.
corr_matrix_abs = corr_matrix.abs()
# Drop score from x and y axis.
corr_matrix_abs.drop('score', axis = 0, inplace = True)
corr_matrix_abs.drop('score', axis = 1, inplace = True)
# Mask off only one of the correlation triangles to avoid duplicates.
corr_triangle = corr_matrix_abs.where(np.triu(np.ones(corr_matrix_abs.shape), k = 1).astype(bool))
# Create list of features with correlation higher than a threshold.
# Threshold set to > 0.75
to_drop = [column for column in corr_triangle.columns if any(corr_triangle[column] > 0.7)]
print(to_drop)
['total_sleep_duration', 'average_hrv', 'lowest_heart_rate', 'medium_activity_time', 'meters_to_target', 'steps', 'total_calories']
# For now let's manually remove features (see TODO note above).
choices_to_drop = [('average_hrv', 'average_heart_rate'),
('lowest_heart_rate', 'average_hrv'),
('medium_activity_time', 'active_calories'),
('meters_to_target', 'active_calories'),
('steps', 'a lot'),
('total_calories', 'a lot')]
bio_sleep_df.drop(['lowest_heart_rate',
'active_calories',
'meters_to_target',
'steps',
'total_calories',
'average_heart_rate',
'time_in_bed',
'total_sleep_duration'],
axis = 1, inplace = True)
Re-plot the correlation matrix to see changes.
# Correlation
# Include score column to see correlation between response and predictors too.
corr_matrix = bio_sleep_df.drop(['score_bin_even', 'score_bin_quantile', 'score_bin_custom'], axis = 1).corr()
# Plot matrix.
fig, ax = plt.subplots(figsize = (16, 16))
sns.heatmap(corr_matrix, ax = ax, annot = True, fmt = '.1f')
<Axes: >
After feature removal this correlation matrix looks much more appropriate for model fitting. Most of the highly correlated features have been selectively removed and as many as could be left that are highly correlated with score remain. It also indicates there are likely more features to be removed that are not going to be involved in modeling sleep score.
Subset Feature Selection / Backwards Feature Selection starting from all features.
# TODO: Implement automated feature selection.
We've seen a few outliers during the EDA that could potentially cause issues in our models. Especially in classification models where the class examples are sparse.
The class bins created earlier can be used to quickly get an idea of where these outliers are, at least in the target variable.
# Identify any score_bin_even where that class n == 1
display(bio_sleep_df['score_bin_even'].value_counts())
one_class = (bio_sleep_df['score_bin_even'].value_counts() == 1)
one_class_drop = (one_class.loc[one_class == True].index)
score_bin_even 70.5 to 80.5 165 80.5 to 90.5 101 60.5 to 70.5 68 50.5 to 60.5 9 40.5 to 50.5 5 90.5 to 100.5 4 30.5 to 40.5 1 Name: count, dtype: int64
Remove any sample which only has a class count of 1. This is also necessary if any class balancing during the train/test split is performed.
# Remove outlier for now so we can class-balance the splits (min per class required = 2)
for value in one_class_drop:
class_outlier_loc = bio_sleep_df.loc[bio_sleep_df['score_bin_even'] == value]
bio_sleep_df.drop(class_outlier_loc.index, axis = 0, inplace = True)
bio_sleep_df.reset_index(drop = True, inplace = True)
Since we'll be using sklearn's StandardScaler() to standardize, we need to do the split at the same time so that we can properly separate the training information from the test set during standardization, preventing data leakage.
In total we will have 3 types of train / test splits to use on our models, but they will be split identically. We can also use the stratify parameter to class balance the training and test sets which will help generalize models training on this fairly sparse dataset.
Note: It would also be better practice to split the data before EDA to prevent data leakage, but this was not necessary for the scope of this project.
rand_state = 87654321 #123456789 #1111 # Used in all models that accept a random state.
test_ratio = 0.2
bin_type = 'score_bin_custom'
# Discretized Scores
# Stratify-balance the classes with the bin column (note this will slightly alter regression scores too).
X_train, X_test, y_train, y_test = train_test_split(bio_sleep_df.iloc[:, 4:], bio_sleep_df.iloc[:, :4],
test_size = test_ratio,
random_state = rand_state)#,
#stratify = bio_sleep_df['score_bin_custom'])
# Binned/Categorical Class Labels
y_train_bin, y_test_bin = y_train[bin_type], y_test[bin_type]
Since we'll be using many machine learning models in which regularization techniques can be sensitive to differences in predictor variance magnitudes/scale (e.g., Lasso/Ridge Regression, SVM, etc.) and expect features to be roughly normally distributed, we will be standardizing features appropriately.
Standardization will be done with the following formula:
$$Z = \frac{(X - \bar{X})}{S}$$Centers inputs around 0 with the sample mean, and scales them with the sample standard deviation.
# Centered and Standardized
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns = scaler.get_feature_names_out())
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns = scaler.get_feature_names_out())
# Squeeze y's to put into a series, same format as train_test_split
y_train_scaled = pd.DataFrame(scaler.fit_transform(np.array(y_train['score']).reshape(-1, 1)), columns = ['score_scaled']).squeeze(axis = 1)
y_test_scaled = pd.DataFrame(scaler.transform(np.array(y_test['score']).reshape(-1, 1)), columns = ['score_scaled']).squeeze(axis = 1)
Something we haven't discussed yet, besides in the realm of feature engineering, is whether or not regression or logistic regression makes more sense for this type of problem.
First thing to note, because this isn't a binary logistic classification problem since we've binned our target variable into multiple classes the way we handle this is important and the sklearn package handles each model a little differently. Methods we'll use to handle this multi-class, sometimes called multinomial, classification problem is to utilize One-vs-Rest, One-vs-One, and Multinomial Loss (aka Cross Entropy) techniques. These methods will handle this issue of multiple classes fine, but we need to ask what we're looking for out of this.
Given the predictor variables and expected prediction, are we expecting a:
It's very important to note that comparing regression and classification models is almost always a fools-errand in building a convincing metric that can relate between the two types of models. In practice, it's better to properly formulate the question beforehand and fit the models that better answer that question instead of fitting them all and attempting to compare things that can't be directly compared.
All that said, because this project is for exploration, let's go ahead and do just that by creating a few of each of regression and logistic regression models and discuss this further in the conclusion section.
An important thing to set before initializing models is an expected baseline to which we can compare performance. At the very least, our models have to improve on these metrics.
A useful metric for classification is Zero Rate which is a "model" that always predicts the most populous class.
# Calculate the random chance baseline for our classification models.
# 1 / number of classes
random_base_class = 1 / len(bin_map_custom.unique())
print(f"The random chance baseline for classification models is: {random_base_class}")
# Calculate the Zero Rate baseline for our classification models.
zero_rate_base_class = y_train_bin.value_counts().max() / len(y_train_bin)
print(f"The Zero Rate accuracy baseline for classification models that always chooses the most populous class is: {zero_rate_base_class}")
The random chance baseline for classification models is: 0.16666666666666666 The Zero Rate accuracy baseline for classification models that always chooses the most populous class is: 0.2491103202846975
A good baseline for regression models is to identify what metrics will be used to evaluate your regression models, then use the mean or median to measure a baseline of that metric.
For example, measure the Mean Absolute Error (MAE) if a "model" always predicted values of the median of the target variable (sleep score).
# Calculate the Zero Rate baseline for our classification models.
y_pred_baseline = np.full_like(y_train['score'], fill_value = y_train['score'].median(), dtype = float)
mae_baseline_regress = mean_absolute_error(y_train['score'], y_pred_baseline)
print(f"The baseline Mean Absolute Error (MAE) for regression models is: {mae_baseline_regress}")
The baseline Mean Absolute Error (MAE) for regression models is: 6.249110320284697
class MidpointNormalize(Normalize):
def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
self.midpoint = midpoint
Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
return np.ma.masked_array(np.interp(value, x, y))
def plotSearchGrid(grid):
scores = [x for x in grid.cv_results_["mean_test_score"]]
scores = np.array(scores).reshape(len(grid.param_grid['max_depth']), len(grid.param_grid['max_leaf_nodes']))
plt.figure(figsize=(10, 8))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.plasma,
norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('max_leaf_nodes')
plt.ylabel('max_depth')
plt.colorbar()
plt.xticks(np.arange(len(grid.param_grid['max_leaf_nodes'])), grid.param_grid['max_leaf_nodes'], rotation=90)
plt.yticks(np.arange(len(grid.param_grid['max_depth'])), grid.param_grid['max_depth'])
plt.title('Validation Score')
plt.show()
def plotSearchGridSVM(grid):
scores = [x for x in grid.cv_results_["mean_test_score"]]
scores = np.array(scores).reshape(len(grid.param_grid['C']), len(grid.param_grid['gamma']))
plt.figure(figsize=(10, 8))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores, interpolation='nearest', cmap=plt.cm.plasma,
norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('Gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(grid.param_grid['gamma'])), grid.param_grid['gamma'], rotation=90)
plt.yticks(np.arange(len(grid.param_grid['C'])), grid.param_grid['C'])
plt.title('Validation Score')
plt.show()
# Initialize list of fitted model objects.
models_regression = []
models_classification = []
After standardization, feature, and outlier removal let's see how the metrics associated with the linear regression model has changed.
Note: in comparison with the previous regression model fit during the EDA section, which was only used for feature exploration, this model has been fit with only the training data split. Still, we can still draw some useful conclusions from this model
mod_linear_OLS = sm.OLS(y_train['score'].reset_index(drop = True), sm.add_constant(X_train_scaled)).fit()
models_regression.append(mod_linear_OLS)
mod_linear_OLS.summary()
| Dep. Variable: | score | R-squared: | 0.882 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.874 |
| Method: | Least Squares | F-statistic: | 108.7 |
| Date: | Wed, 28 Feb 2024 | Prob (F-statistic): | 3.15e-110 |
| Time: | 16:52:22 | Log-Likelihood: | -696.78 |
| No. Observations: | 281 | AIC: | 1432. |
| Df Residuals: | 262 | BIC: | 1501. |
| Df Model: | 18 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 75.8826 | 0.178 | 425.234 | 0.000 | 75.531 | 76.234 |
| deep_sleep_duration | 3.6552 | 0.232 | 15.764 | 0.000 | 3.199 | 4.112 |
| light_sleep_duration | 3.4404 | 0.252 | 13.631 | 0.000 | 2.943 | 3.937 |
| rem_sleep_duration | 4.0485 | 0.192 | 21.048 | 0.000 | 3.670 | 4.427 |
| restless_periods | -1.2579 | 0.247 | -5.083 | 0.000 | -1.745 | -0.771 |
| awake_time | -1.9036 | 0.210 | -9.079 | 0.000 | -2.316 | -1.491 |
| average_breath | 0.0709 | 0.213 | 0.334 | 0.739 | -0.348 | 0.489 |
| average_hrv | 0.3172 | 0.211 | 1.500 | 0.135 | -0.099 | 0.734 |
| latency | -2.5018 | 0.214 | -11.704 | 0.000 | -2.923 | -2.081 |
| bedtime_start_delta | -1.6799 | 0.224 | -7.493 | 0.000 | -2.121 | -1.238 |
| nap_today | -0.0134 | 0.194 | -0.069 | 0.945 | -0.395 | 0.368 |
| high_activity_time | 0.2556 | 0.196 | 1.307 | 0.192 | -0.130 | 0.641 |
| low_activity_time | 0.0520 | 0.583 | 0.089 | 0.929 | -1.097 | 1.201 |
| medium_activity_time | 0.5161 | 0.321 | 1.610 | 0.109 | -0.115 | 1.147 |
| non_wear_time | 0.3917 | 0.675 | 0.580 | 0.562 | -0.938 | 1.721 |
| resting_time | 0.6315 | 0.526 | 1.200 | 0.231 | -0.405 | 1.668 |
| sedentary_time | 0.6850 | 0.809 | 0.847 | 0.398 | -0.908 | 2.278 |
| temperature_deviation | 0.0768 | 0.197 | 0.390 | 0.697 | -0.311 | 0.465 |
| spo2_percentage | 0.1773 | 0.202 | 0.878 | 0.381 | -0.221 | 0.575 |
| Omnibus: | 17.170 | Durbin-Watson: | 2.142 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 19.924 |
| Skew: | -0.525 | Prob(JB): | 4.72e-05 |
| Kurtosis: | 3.775 | Cond. No. | 11.4 |
The expectation here was not to finely tune this model, but to use this model summary to help understand how the data preparation has effected the models.
As was said, further tuning could be done here to improve this model but it was still beneficial in sanity checking the data preparation process.
mod_log_regression = LogisticRegression(solver = 'lbfgs', multi_class = 'multinomial', max_iter = 30000, random_state = rand_state).fit(X_train, y_train_bin)
models_classification.append(mod_log_regression)
mod_lin_regression = LinearRegression().fit(X_train, y_train['score'])
models_regression.append(mod_lin_regression)
mod_ridge_regression = Ridge(max_iter = 10000, random_state = rand_state).fit(X_train, y_train['score'])
models_regression.append(mod_ridge_regression)
mod_ridge_classifier = RidgeClassifier(max_iter = 10000, random_state = rand_state).fit(X_train, y_train_bin)
models_classification.append(mod_ridge_classifier)
mod_lasso_regression = Lasso(max_iter = 10000, random_state = rand_state).fit(X_train, y_train['score'])
models_regression.append(mod_lasso_regression)
/opt/homebrew/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of f AND g EVALUATIONS EXCEEDS LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
For hyperparameter tuning we will use a grid search with a range of C and Gamma values, and apply 3-fold cross-validation.
params = {'C': np.logspace(6, 10, base = 2), 'gamma': np.logspace(-10, 0, base = 2)}
grid_svm_regression = GridSearchCV(SVR(kernel = 'rbf'), param_grid = params, cv = 3).fit(X_train_scaled, y_train['score'])
mod_svm_regression = grid_svm_regression.best_estimator_
models_regression.append(mod_svm_regression)
grid_svm_classifier = GridSearchCV(SVC(kernel = 'rbf', random_state = rand_state, probability = True), param_grid = params, cv = 3).fit(X_train_scaled, y_train_bin)
mod_svm_classifier = grid_svm_classifier.best_estimator_
models_classification.append(mod_svm_classifier)
plotSearchGridSVM(grid_svm_regression)
scores = [x for x in grid_svm_regression.cv_results_["mean_test_score"]]
max_index = scores.index(max(scores))
max_score = max(scores)
opt_C = grid_svm_regression.cv_results_['param_C'][max_index]
opt_Gamma = grid_svm_regression.cv_results_['param_gamma'][max_index]
print(f'GridSearch results indicate the optimized parameters are, \nC: {opt_C}, Gamma: {opt_Gamma}, Validation R2 Score: {max_score}')
# OR use the attribute .best_params_
grid_svm_regression.best_params_
GridSearch results indicate the optimized parameters are, C: 1024.0, Gamma: 0.001124953927697749, Validation R2 Score: 0.8797566276176538
{'C': 1024.0, 'gamma': 0.001124953927697749}
A decision tree with no parameter changes is a good baseline
dt_classifier = DecisionTreeClassifier(max_depth = None, max_leaf_nodes = None, random_state = rand_state).fit(X_train, y_train_bin)
models_classification.append(dt_classifier)
print(dt_classifier.score(X_train, y_train_bin))
print(dt_classifier.score(X_test, y_test_bin))
1.0 0.30985915492957744
adaboost_classifier = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 1),
n_estimators = 100,
random_state = rand_state,
learning_rate = 0.5
).fit(X_train, y_train_bin)
models_classification.append(adaboost_classifier)
print(adaboost_classifier.score(X_test, y_test_bin))
results_adaboost = list(zip(adaboost_classifier.predict(X_test), y_test['score']))
0.23943661971830985
params = {'max_depth': np.linspace(1, 25, 25, dtype = int), 'max_leaf_nodes': np.linspace(2, 98, 25, dtype = int)}
grid_random_forest_regression = GridSearchCV(RandomForestRegressor(n_estimators = 250, random_state=rand_state),
param_grid = params,
cv = 3,
#verbose = 3,
n_jobs = -1
).fit(X_train, y_train['score'])
mod_random_forest_regression = grid_random_forest_regression.best_estimator_
models_regression.append(mod_random_forest_regression)
params = {'max_depth': np.linspace(1, 25, 25, dtype = int), 'max_leaf_nodes': np.linspace(2, 98, 25, dtype = int)}
grid_random_forest_classifier = GridSearchCV(RandomForestClassifier(n_estimators = 250, random_state=rand_state),
param_grid = params,
cv = 3,
#verbose = 3,
n_jobs = -1
).fit(X_train, y_train_bin)
mod_random_forest_classifier = grid_random_forest_classifier.best_estimator_
models_classification.append(mod_random_forest_classifier)
plotSearchGrid(grid_random_forest_regression)
feat_importance = grid_random_forest_regression.best_estimator_.feature_importances_
feat_imp_std = np.std([tree.feature_importances_ for tree in grid_random_forest_regression.best_estimator_.estimators_], axis = 0)
forest_importances = pd.Series(feat_importance, index=bio_sleep_df.columns[4:])
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr = feat_imp_std, ax = ax)
ax.set_title("Feature Importance Regressor")
ax.set_ylabel("Mean Decrease in Impurity")
fig.tight_layout()
plotSearchGrid(grid_random_forest_classifier)
feat_importance = grid_random_forest_classifier.best_estimator_.feature_importances_
feat_imp_std = np.std([tree.feature_importances_ for tree in grid_random_forest_classifier.best_estimator_.estimators_], axis = 0)
forest_importances = pd.Series(feat_importance, index=bio_sleep_df.columns[4:])
fig, ax = plt.subplots()
forest_importances.plot.bar(yerr = feat_imp_std, ax = ax)
ax.set_title("Feature Importance - Classifier")
ax.set_ylabel("Mean Decrease in Impurity")
fig.tight_layout()
def scores_classification(models):
"""Calculates and returns accuracy score from fitted classifier models.
Parameters:
models (list): Iterable of fitted classifier model objects.
Returns:
Dictionary where model name is the key and it's accuracy score is the value.
{'model name' : accuracy}"""
# Create dictionary to return dataframe of scores.
model_scores = {}
print(f'Using target variable bin interval: {bin_type}')
for i, mod in enumerate(models):
# Check if model is the SVC(), uses the scaled features.
if 'SVC' in str(mod):
X_test_temp = X_test_scaled
else:
X_test_temp = X_test
model_name = str(mod).partition('(')[0]
accuracy_score = mod.score(X_test_temp, y_test_bin)
f1score = f1_score(y_test_bin, mod.predict(X_test_temp), average = 'weighted')
# Set scores to model in dictionary.
model_scores[model_name] = {'Accuracy': accuracy_score,
'F1_Score(class_weighted)': f1score}
return model_scores
def ols_scores(model):
"""Calculates and returns regression metrics for statsmodels OLS() fitted models.
Parameters:
model (object): Fitted regressor model object.
Returns:
Nested dictionary where model name is the key and the model evaluation metrics are nested within.
{'model name' : {'R2': score,
'Adj_R2': score,
'RMSE': score,
'MAE': score}}"""
y_pred = model.predict(sm.add_constant(X_test_scaled))
r2 = model.rsquared
adj_r2 = model.rsquared_adj
rmse = mean_squared_error(y_test['score'], y_pred, squared = False)
mae = mean_absolute_error(y_test['score'], y_pred)
model_score = {'R2': r2,
'Adj_R2': adj_r2,
'RMSE': rmse,
'MAE': mae}
return model_score
def scores_regression(models):
"""Calculates and returns R2, Adjusted R2, Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) from fitted regressor models.
Parameters:
models (list): Iterable of fitted regressor model objects.
Returns:
Nested dictionary where model name is the key and the model evaluation metrics are nested within.
{'model name' : {'R2': score,
'Adj_R2': score,
'RMSE': score,
'MAE': score}}"""
# Create dictionary to return dataframe of scores.
model_scores = {}
for mod in models:
model_name = str(mod).partition('(')[0]
# Check if model is the SVR(), uses the scaled features.
if 'SVR' in model_name:
X_test_temp = X_test_scaled
elif 'statsmodels' in model_name:
model_scores['LinearRegression(scaled)'] = ols_scores(mod)
continue
else:
X_test_temp = X_test
y_pred = mod.predict(X_test_temp)
r2 = mod.score(X_test_temp, y_test['score'])
n , p = X_test_temp.shape # number of samples and predictors
adj_r2 = 1 - (((1 - r2) * (n - 1)) / (n - p - 1))
rmse = mean_squared_error(y_test['score'], y_pred, squared = False)
mae = mean_absolute_error(y_test['score'], y_pred)
# Set scores to model in dictionary.
model_scores[model_name] = {'R2': r2,
'Adj_R2': adj_r2,
'RMSE': rmse,
'MAE': mae}
return model_scores
def plot_predictions_vs_score_regression(models):
"""Prints to output (n x 3) subplot array of each model's predicted sleep scores and observed (truth) sleep scores for each X_test, y_test pair, sorted by descending sleep score.
(where n = number of model objects). Can handle any number of input models.
Parameters:
models (list): Iterable of fitted regressor model objects.
Returns:
None. Prints plot to output."""
fig, ax = plt.subplots(int(np.ceil(len(models) / 3)), 3, figsize = (11, 7))
for i, mod in enumerate(models):
# Check if model is the SVR() or OLS(), uses the scaled features.
if 'SVR' in str(mod):
X_test_temp = X_test_scaled
else:
X_test_temp = X_test
# Check if model is a statsmodel object, if so use their score conventions.
if 'statsmodels' in str(mod):
temp_scores = ols_scores(mod)
y_pred = mod.predict(sm.add_constant(X_test_scaled))
r2 = temp_scores['R2']
adj_r2 = temp_scores['Adj_R2']
rmse = temp_scores['RMSE']
mae = temp_scores['MAE']
model_name = 'LinearRegression(scaled)'
else:
y_pred = mod.predict(X_test_temp)
r2 = mod.score(X_test_temp, y_test['score'])
n , p = X_test_temp.shape # number of samples and predictors
adj_r2 = 1 - (((1 - r2) * (n - 1)) / (n - p - 1))
rmse = mean_squared_error(y_test['score'], y_pred, squared = False)
mae = mean_absolute_error(y_test['score'], y_pred)
model_name = str(mod).partition('(')[0]
results = list(zip(y_pred, y_test['score']))
results = sorted(results, key = lambda x: x[1], reverse = True)
ax[i // 3, i % 3].plot(results, alpha = 0.85, drawstyle = 'steps')
ax[i // 3, i % 3].set_xlabel('Index (sorted)')
ax[i // 3, i % 3].set_ylabel('Sleep Score')
ax[i // 3, i % 3].set_title(model_name)
ax[i // 3, i % 3].grid(alpha = 0.4)
ax[i // 3, i % 3].text(x = 0.1,
y = 0.2,
transform = ax[i // 3, i % 3].transAxes,
s = (f"{mae:0.3f} - MAE"
'\n'f"{rmse:0.3f} - RMSE"
'\n'f"{r2:0.3f} - R2"
'\n'f"{adj_r2:0.3f} - Adj_R2"))
# Check if odd number of models and delete last subplot if true.
if len(models) % 2 != 0:
fig.delaxes(ax[len(models) // 3, (len(models) % 3)])
plt.subplots_adjust(wspace = 0.2, hspace = 0.4)
fig.legend(['Predicted', 'Truth'], loc = 'upper right')
fig.suptitle('Predicted Score and True Score', fontsize = 12)
return None
def plot_obs_vs_pred_regression(models):
"""Prints to output (n x 3) subplot array of each model's observed vs predicted plots.
(where n = number of model objects). Can handle any number of input models.
Parameters:
models (list): Iterable of fitted regressor model objects.
Returns:
None. Prints plot to output."""
fig, ax = plt.subplots(int(np.ceil(len(models) / 3)), 3, figsize = (11, 7))
for i, mod in enumerate(models):
# Check if model is the SVR(), uses the scaled features.
if 'SVR' in str(mod):
X_test_temp = X_test_scaled
else:
X_test_temp = X_test
# Check if model is a statsmodel object, if so use their score conventions.
if 'statsmodels' in str(mod):
y_pred = mod.predict(sm.add_constant(X_test_scaled))
model_name = 'LinearRegression(scaled)'
else:
y_pred = mod.predict(X_test_temp)
model_name = str(mod).partition('(')[0]
ax[i // 3, i % 3].scatter(y_pred, y_test['score'], alpha = 0.5)
ax[i // 3, i % 3].plot(np.arange(min(y_test['score']), 100, 1), np.arange(min(y_test['score']), 100, 1), 'k--', alpha = 0.6)
ax[i // 3, i % 3].set_xlabel('Predictions')
ax[i // 3, i % 3].set_ylabel('Observations')
ax[i // 3, i % 3].set_title(model_name)
ax[i // 3, i % 3].grid(alpha = 0.4)
# Check if odd number of models and delete last subplot if true.
if len(models) % 2 != 0:
fig.delaxes(ax[len(models) // 3, (len(models) % 3)])
plt.subplots_adjust(wspace = 0.3, hspace = 0.4)
fig.suptitle('Observed (Truth) vs Predicted', fontsize = 12)
return None
def plot_models_roc(models):
"""Prints to output (n x 3) subplot array of each model's ROC curve plot.
(where n = number of model objects). Can handle any number of input models.
Parameters:
models (list): Iterable of fitted classifier model objects.
Returns:
None. Prints plot to output."""
# Need to use LabelBinarizer() since we have multiple classes.
label_binarizer = LabelBinarizer().fit(y_train_bin)
y_onehot_test = label_binarizer.transform(y_test_bin)
for mod in models:
if 'Ridge' in str(mod): # Can't calculate probabilities in Ridge.
continue
if 'SVC' in str(mod): # Use scaled X_test values
y_scores = mod.predict_proba(X_test_scaled)
else:
y_scores = mod.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_onehot_test.ravel(), y_scores.ravel())
auc_score = auc(fpr, tpr)
plt.plot(fpr, tpr, label = f"{str(mod).partition('(')[0]} - AUC = {auc_score:0.2f}")
plt.plot(np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), 'k--')
plt.title('ROC - Micro-Average OvR')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(alpha = 0.4)
plt.legend(loc = 'lower right')
return None
# Function to iterate through classification models and plot confusion matrix together.
def plot_models_confusion_matrix(models):
"""Prints to output (n x 3) subplot array of each model's confusion matrix.
(where n = number of model objects). Can handle any number of input models.
Parameters:
models (list): Iterable of fitted classifier model objects.
Returns:
None. Prints plot to output."""
fig, ax = plt.subplots(int(np.ceil(len(models) / 3)), 3, figsize = (10, 8), sharey = True)
for i, mod in enumerate(models):
# Check if model is the SVC(), uses the scaled features.
if 'SVC' in str(mod):
X_test_temp = X_test_scaled
else:
X_test_temp = X_test
cm = confusion_matrix(y_test_bin, mod.predict(X_test_temp), labels = y_test_bin.sort_values().unique())
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = y_test_bin.sort_values().unique())
disp.plot(ax = ax[i // 3, i % 3], xticks_rotation = 90)
disp.ax_.set_title(str(mod).partition('(')[0])
disp.im_.colorbar.remove()
# Check if odd number of models and delete last subplot if true.
if len(models) % 2 != 0:
fig.delaxes(ax[i // 3, (len(models) % 3)])
fig.tight_layout()
plt.subplots_adjust(wspace = 0.2, hspace = 0)
fig.colorbar(disp.im_, ax = ax)
fig.suptitle('Confusion Matrices of Classifier Models')
return None
results_regression = scores_regression(models_regression)
display(pd.DataFrame(results_regression).T.sort_values(by = 'MAE'))
| R2 | Adj_R2 | RMSE | MAE | |
|---|---|---|---|---|
| SVR | 0.866523 | 0.820319 | 2.526703 | 2.041957 |
| Lasso | 0.838919 | 0.783160 | 2.775705 | 2.212797 |
| Ridge | 0.837268 | 0.780937 | 2.789897 | 2.220879 |
| LinearRegression | 0.837241 | 0.780901 | 2.790127 | 2.221564 |
| LinearRegression(scaled) | 0.881864 | 0.873748 | 2.790127 | 2.221564 |
| RandomForestRegressor | 0.665764 | 0.550067 | 3.998320 | 3.021399 |
Models have been sorted by Mean Absolute Error (MAE).
This looks great, if you recall our baseline for regression models was measuring MAE with the median sleep score of the training data (see below).
print(f"The baseline Mean Absolute Error (MAE) for regression models is: {mae_baseline_regress}")
The baseline Mean Absolute Error (MAE) for regression models is: 6.249110320284697
The SVM has improved that error by over 4. All models show improvements to this baseline.
plot_predictions_vs_score_regression(models_regression)
This graph and the following gives a good view on how each method handles different ranges of sleep score.
plot_obs_vs_pred_regression(models_regression)
This graph illustrates the same as the previous and helps identify where the models performs best throughout the distribution. In this case it is also apparent which model performed best based on which plot concentrates the most points around the 45 degree line.
results_classification = scores_classification(models_classification)
display(pd.DataFrame(results_classification).T.sort_values(by = 'Accuracy', ascending = False))
Using target variable bin interval: score_bin_custom
| Accuracy | F1_Score(class_weighted) | |
|---|---|---|
| SVC | 0.633803 | 0.626915 |
| LogisticRegression | 0.605634 | 0.607407 |
| RandomForestClassifier | 0.563380 | 0.538434 |
| RidgeClassifier | 0.436620 | 0.376509 |
| DecisionTreeClassifier | 0.309859 | 0.304995 |
| AdaBoostClassifier | 0.239437 | 0.174365 |
The baseline accuracy for classifications models was using the Zero Rate, or a model that only ever predicts the most frequent class in the training data (see below).
print(f"The Zero Rate baseline for classification models that always chooses the most populous class is: {zero_rate_base_class}")
The Zero Rate baseline for classification models that always chooses the most populous class is: 0.2491103202846975
So our baseline accuracy rate of ~ 0.25 has been beaten quite a bit with the SVM accuracy score of ~ 0.63.
plot_models_confusion_matrix(models_classification)
plot_models_roc(models_classification)
Because we have imbalanced classes, the best metric to compare these classification models is ROC Curve / AUC. Though, in this case the rankings do align with the Accuracy score. Accuracy score, however, doesn't show the full story on how the models are predicting with concern to False Positive Rate (FPR) and True Positive Rate (TPR) metrics.
print('Best Regression Model (by Mean Absolute Error (MAE)):')
display(pd.DataFrame(results_regression).T.sort_values(by = 'MAE').iloc[0:1, :])
print('-----------------------------------------------------')
print('Best Classification Model (by Accuracy Score):')
display(pd.DataFrame(results_classification).T.sort_values(by = 'Accuracy', ascending = False).iloc[0:1, :])
Best Regression Model (by Mean Absolute Error (MAE)):
| R2 | Adj_R2 | RMSE | MAE | |
|---|---|---|---|---|
| SVR | 0.866523 | 0.820319 | 2.526703 | 2.041957 |
----------------------------------------------------- Best Classification Model (by Accuracy Score):
| Accuracy | F1_Score(class_weighted) | |
|---|---|---|
| SVC | 0.633803 | 0.626915 |
Analyzing the Results and Regression vs Classification:
At a glance and through several iterations of training/validation the Support Vector Machines (SVM) were consistently at the top in both regression and classification. Something to note is not all these models were fitted using cross-validation which potentially could improve the others that weren't.
The short answer:
This dataset is currently small and it seems many of the models would benefit from a larger and more balanced dataset. Currently the data was split in a 80/20 training testing split which totaled 281 and 71 respectively. This fact limited the ability to perform cross-validations and compromises were made to allow for an adequate test set size. As this dataset increases in size, all the models will likely see reduced errors and better fits.
There is no true way to compare a regression model and classification model. Therefore, if the data you're using is originally formatted for regression and the problem you're trying to solve requires a quantitative answer, you should do regression. The exception is if you desire a categorical prediction instead, which you could still perform regression and then turn the prediction into a categorical label. Regression makes the most sense for predicting sleep scores with this dataset. It also gives us the most interpretability of the model given that we know our margin of error compared to categorical bins with fairly arbitrary edge decisions.
The SVM regression model scored the best across all metrics except the Adjusted R2, beat out by the linear regression model with scaled features. The Mean Absolute Error (MAE) baseline score was improved from ~6.2 to ~2.0 indicating a large reduction in error.
MAE or Root Mean Squared Error (RMSE) makes the most sense to compare the regression models here because it gives the clearest picture of how much error predictions will have on average and aligns well with the target variable score, giving a highly interpretable margin on predictions. MAE is the choice here as the data still has outliers given its small size that are effects RMSE.
Interestingly, the linear regression models also performed well too. Further tuning of these models would be beneficial.
The long answer / Tangent:
As was discussed before there is a lot to be said on the subject of taking what is a regression task and casting it into the space of a classification task. It could be argued that the target variable score as a ground truth is fuzzy on whether it is a proper quantitative measure. This is because we do not know the methods it was obtained by Oura, and if it makes sense to measure sleep in discrete values 0-100. Not to mention, even in a medical observational sleep-study, the science of measuring the sleep quality relies on measuring metrics of several biological systems that are correlated to sleep; only achieving a best approximation itself.
However, as the user of the Oura Ring now for over one year, I see a high amount of correlation between their sleep scores and how I perceived the previous night's sleep. The question to be asked here is how much of a difference is, say, a score of 81 vs 85 and what would be lost changing from discrete values to categorical binning of the target variable.
To review what was done during data preparation, the sleep score values are integers which range from 0-100 (mostly distributed around 60-90) and were binned into ordinal categorical scoring intervals to be used in classification. For this run the following was used:
This does several things to change our predictions:
Based on the results above we can see that regression gives us more interpretable and better results. Why is this true?
Interpretability
Regression is the better model for this task.
PLACEHOLDER BELOW:
Testing ordinal classification model.
from statsmodels.miscmodels.ordinal_model import OrderedModel
mod_log = OrderedModel(y_train['score'], X_train, distr='logit')
res_log = mod_log.fit(method='bfgs', disp=False)
res_log.summary()
| Dep. Variable: | score | Log-Likelihood: | -651.45 |
|---|---|---|---|
| Model: | OrderedModel | AIC: | 1423. |
| Method: | Maximum Likelihood | BIC: | 1641. |
| Date: | Wed, 28 Feb 2024 | ||
| Time: | 16:56:28 | ||
| No. Observations: | 281 | ||
| Df Residuals: | 221 | ||
| Df Model: | 18 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| deep_sleep_duration | 0.0014 | 0.000 | 12.579 | 0.000 | 0.001 | 0.002 |
| light_sleep_duration | 0.0009 | 7.65e-05 | 12.114 | 0.000 | 0.001 | 0.001 |
| rem_sleep_duration | 0.0018 | 0.000 | 14.628 | 0.000 | 0.002 | 0.002 |
| restless_periods | -0.0165 | 0.003 | -5.402 | 0.000 | -0.022 | -0.011 |
| awake_time | -0.0004 | 5.42e-05 | -8.252 | 0.000 | -0.001 | -0.000 |
| average_breath | 0.0378 | 0.260 | 0.145 | 0.884 | -0.472 | 0.548 |
| average_hrv | 0.0065 | 0.005 | 1.321 | 0.187 | -0.003 | 0.016 |
| latency | -0.0013 | 0.000 | -10.031 | 0.000 | -0.002 | -0.001 |
| bedtime_start_delta | -0.0003 | 4.69e-05 | -7.279 | 0.000 | -0.000 | -0.000 |
| nap_today | 0.3044 | 0.391 | 0.779 | 0.436 | -0.462 | 1.071 |
| high_activity_time | 0.0007 | 0.001 | 0.938 | 0.348 | -0.001 | 0.002 |
| low_activity_time | -2.068e-05 | 6.59e-05 | -0.314 | 0.754 | -0.000 | 0.000 |
| medium_activity_time | 0.0001 | 8.83e-05 | 1.610 | 0.107 | -3.09e-05 | 0.000 |
| non_wear_time | 1.33e-05 | 6.58e-05 | 0.202 | 0.840 | -0.000 | 0.000 |
| resting_time | 4.963e-05 | 5.76e-05 | 0.861 | 0.389 | -6.33e-05 | 0.000 |
| sedentary_time | 2.717e-05 | 6.71e-05 | 0.405 | 0.686 | -0.000 | 0.000 |
| temperature_deviation | 0.2497 | 0.555 | 0.450 | 0.653 | -0.839 | 1.338 |
| spo2_percentage | 0.1491 | 0.133 | 1.122 | 0.262 | -0.111 | 0.409 |
| 42/44 | 23.7948 | 15.133 | 1.572 | 0.116 | -5.866 | 53.455 |
| 44/46 | -0.2703 | 0.996 | -0.271 | 0.786 | -2.223 | 1.682 |
| 46/48 | -0.1632 | 0.969 | -0.168 | 0.866 | -2.062 | 1.735 |
| 48/49 | 0.3219 | 0.853 | 0.377 | 0.706 | -1.350 | 1.993 |
| 49/53 | 0.2149 | 0.854 | 0.252 | 0.801 | -1.459 | 1.889 |
| 53/56 | 0.1616 | 0.846 | 0.191 | 0.848 | -1.496 | 1.820 |
| 56/57 | -0.0129 | 0.880 | -0.015 | 0.988 | -1.737 | 1.711 |
| 57/58 | 0.3788 | 0.508 | 0.745 | 0.456 | -0.618 | 1.375 |
| 58/59 | -1.0746 | 0.962 | -1.117 | 0.264 | -2.961 | 0.811 |
| 59/60 | -1.2145 | 0.965 | -1.259 | 0.208 | -3.106 | 0.677 |
| 60/61 | -0.6869 | 0.667 | -1.030 | 0.303 | -1.994 | 0.620 |
| 61/62 | -0.7884 | 0.668 | -1.181 | 0.238 | -2.097 | 0.520 |
| 62/63 | -0.3141 | 0.462 | -0.680 | 0.497 | -1.220 | 0.592 |
| 63/64 | -1.8364 | 0.982 | -1.871 | 0.061 | -3.761 | 0.088 |
| 64/65 | -0.8204 | 0.550 | -1.491 | 0.136 | -1.899 | 0.258 |
| 65/66 | -0.4113 | 0.382 | -1.076 | 0.282 | -1.161 | 0.338 |
| 66/67 | -1.6972 | 0.694 | -2.447 | 0.014 | -3.057 | -0.338 |
| 67/68 | -0.7551 | 0.391 | -1.932 | 0.053 | -1.521 | 0.011 |
| 68/69 | -0.4521 | 0.314 | -1.438 | 0.150 | -1.068 | 0.164 |
| 69/70 | -0.4253 | 0.312 | -1.363 | 0.173 | -1.037 | 0.186 |
| 70/71 | -0.5964 | 0.355 | -1.681 | 0.093 | -1.292 | 0.099 |
| 71/72 | -1.1116 | 0.480 | -2.316 | 0.021 | -2.052 | -0.171 |
| 72/73 | -0.1188 | 0.254 | -0.467 | 0.641 | -0.617 | 0.380 |
| 73/74 | -0.9134 | 0.363 | -2.514 | 0.012 | -1.626 | -0.201 |
| 74/75 | -0.5183 | 0.285 | -1.820 | 0.069 | -1.076 | 0.040 |
| 75/76 | -0.0228 | 0.210 | -0.109 | 0.913 | -0.434 | 0.388 |
| 76/77 | -0.4332 | 0.251 | -1.727 | 0.084 | -0.925 | 0.059 |
| 77/78 | -0.1928 | 0.213 | -0.907 | 0.364 | -0.609 | 0.224 |
| 78/79 | -0.3177 | 0.226 | -1.408 | 0.159 | -0.760 | 0.125 |
| 79/80 | -0.6815 | 0.274 | -2.487 | 0.013 | -1.218 | -0.144 |
| 80/81 | -0.5816 | 0.262 | -2.218 | 0.027 | -1.096 | -0.068 |
| 81/82 | -0.1008 | 0.212 | -0.476 | 0.634 | -0.516 | 0.314 |
| 82/83 | -0.4840 | 0.283 | -1.708 | 0.088 | -1.040 | 0.071 |
| 83/84 | -0.2246 | 0.257 | -0.872 | 0.383 | -0.729 | 0.280 |
| 84/85 | 0.1782 | 0.229 | 0.778 | 0.436 | -0.271 | 0.627 |
| 85/86 | -0.6495 | 0.426 | -1.525 | 0.127 | -1.484 | 0.185 |
| 86/87 | -0.1307 | 0.373 | -0.350 | 0.726 | -0.862 | 0.601 |
| 87/88 | -0.0134 | 0.370 | -0.036 | 0.971 | -0.739 | 0.712 |
| 88/89 | -1.1827 | 0.688 | -1.718 | 0.086 | -2.532 | 0.167 |
| 89/90 | 0.4771 | 0.349 | 1.368 | 0.171 | -0.207 | 1.161 |
| 90/92 | -0.7199 | 0.951 | -0.757 | 0.449 | -2.584 | 1.144 |
| 92/94 | 0.4429 | 0.645 | 0.687 | 0.492 | -0.821 | 1.706 |
y_scores = np.array(res_log.predict(X_test))
label_binarizer = LabelBinarizer().fit(y_train['score'])
y_onehot_test = label_binarizer.transform(y_test['score'])
fpr, tpr, _ = roc_curve(y_onehot_test.ravel(), y_scores.ravel())
auc_score = auc(fpr, tpr)
plt.plot(fpr, tpr, label = f"{str(res_log).partition('(')[0]} - AUC = {auc_score:0.2f}")
plt.plot(np.arange(0, 1.1, 0.1), np.arange(0, 1.1, 0.1), 'k--')
plt.title('ROC - Micro-Average OvR')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(alpha = 0.4)
plt.legend(loc = 'lower right')
<matplotlib.legend.Legend at 0x29a1d1510>
sorted(y_train['score'].unique())
[42, 44, 46, 48, 49, 53, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 92, 94]
# results_ord = []
# for i in y_scores:
# results_ord.append(labels_custom[i.argmax()])
results_ord = []
for i in y_scores:
results_ord.append(sorted(y_train['score'].unique())[i.argmax()])
# (results_ord == y_test_bin).sum() / len(y_test_bin)
(results_ord == y_test['score']).sum() / len(y_test_bin)
0.16901408450704225
# f1_score(y_test_bin, results_ord, average = 'weighted')
f1_score(y_test['score'], results_ord, average = 'weighted')
0.11501737698920797
# MAE
abs(results_ord - y_test['score']).sum() / len(y_test['score'])
2.3661971830985915
# RMSE
np.sqrt(((results_ord - y_test['score'])**2).sum() / len(y_test['score']))
2.969326760134476
# cm = confusion_matrix(y_test_bin, results_ord, labels = y_test_bin.sort_values().unique())
# disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = y_test_bin.sort_values().unique())
# disp.plot(xticks_rotation = 90)
cm = confusion_matrix(y_test['score'], results_ord, labels = y_test['score'].sort_values().unique())
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = y_test['score'].sort_values().unique())
disp.plot(xticks_rotation = 90)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x29a281b10>
# Exported via command line using:
# HTML
#jupyter nbconvert Super_Sleep_Notebook.ipynb --to html
#jupyter nbconvert Super_Sleep_Notebook.ipynb --to html --HTMLExporter.theme=dark
# PDF
#jupyter nbconvert Super_Sleep_Notebook.ipynb --to webpdf
#jupyter nbconvert Super_Sleep_Notebook.ipynb --to webpdf --HTMLExporter.theme=dark